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FOREWORD 


This  document  describes  the  development  and  field  testiny  of  behaviorally 
anchored  rating  scales  for  evaluating  performance  of  first-term  personnel  in 
nine  Military  Occupational  Specialties  (MOS).  The  research  was  part  cf  Project 
A,  the  Army's  current,  large-scale  manpower  and  personnel  effort  to  improve  the 
selection,  classification,  and  utilization  of  Army  enlisted  personnel.  The 
thrust  for  the  project  came  from  the  practical,  professional,  and  legal  need 
to  validate  the  Armed  Services  Vocational  Aptitude  Battery  (ASVAB — the  current 
U.S.  military  selection/classification  test  battery)  and  other  selection  vari¬ 
ables  as  predictors  of  training  and  performance. 

Project  A  is  being  conducted  under  contract  to  the  Selection  and  Classi¬ 
fication  Technical  Area  (SCTA)  of  the  Manpower  and  Personnel  Research  Labora¬ 
tory  (MPRL)  at  the  U.S.  Army  Research  Institute  for  the  Behavioral  and  Social 
Sciences  (ARI).  The  portion  of  the  effort  described  herein  is  devoted  to  the 
development  and  validation  of  Army  Selection  and  Classification  Measures,  and 
referred  to  as  "Project  A."  This  research  supports  the  MPRL  and  SCTA  mission 
to  improve  the  Army's  capability  to  select  and  classify  its  applicants  for  en¬ 
listment  or  reenlistment  by  ensuring  that  fair  and  valid  measures  are  developed 
to  evaluate  applicant  potential  based  on  expected  job  performance  and  utility 
to  the  Army. 

Project  A  was  authorized  through  a  Letter,  DCSOPS,  "Army  Research  Project 
to  Validate  the  Predictive  Value  of  the  Armed  Services  Vocational  Aptitude 
Battery,"  effective  19  November  1980;  and  a  Memorandum,  Assistant  Secretary  of 
Defense  (MRA^L),  "Enlistment  Standards,"  effective  11  September  1980. 

In  order  to  ensure  that  Project  A  research  achieves  its  full  scientific 
potential  and  will  be  maximally  useful  to  the  Army,  a  governance  advisory  group 
comprised  of  Army  general  officers;  interservice  scientists;  and  experts  in 
personnel  measurement,  selection,  and  classification  was  established.  Members 
of  the  latter  component  provide  guidance  on  technical  aspects  of  the  research, 
while  general  officer  and  interservice  components  oversee  the  entire  research 
effort;  provide  military  judgment;  periodically  review  research  progress,  re¬ 
sults,  and  plans;  and  coordinate  within  their  commands.  Members  of  the  General 
Officer's  Advisory  Group  include  MG  Porter  (DMPM)  (Chair),  MG  Briggs  (FORSCOM, 
DCSPER),  MG  Knudson  (DCSOPS),  BG  Franks  (USAREUR,  ADCSOPS),  and  MG  Edmonds 
(TRADOC,  DCS-T).  The  General  Officer's  Advisory  Group  was  briefed  in  May  1985 
on  the  issue  of  obtaining  proponent  concurrence  of  the  criterion  measures  be¬ 
fore  administering  the  concurrent  validation.  Members  of  Project  A's  Scien¬ 
tific  Advisory  Group  (SAG),  who  guide  the  technical  quality  of  the  research, 
include  Drs.  Milton  Hakel  (Chair),  Philip  Bobko,  Thomas  Cook,  Lloyd  Humphreys, 
Robert  Linn,  Mary  Tenopyr,  and  Jay  Uhlaner.  The  SAG  was  briefed  in  October 
1984  on  the  results  of  the  Batch  A  field  test  administration.  Further,  the 
SAG  was  briefed  in  March  1985  on  the  contents  of  the  proposed  Trial  Battery. 

A  comprehensive  set  of  new  selection/classification  tests  and  job  perfor¬ 
mance/training  criteria  have  been  developed  and  field  tested.  Results  from 
the  Project  A  field  tests  and  subsequent  concurrent  validation  will  be  used 
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DEVELOPMENT  AND  FIELD  TEST  OF  BEHAVIORALLY  ANCHORED  RATING  SCALES 
FOR  NINE  MOS 


EXECUTIVE  SUMMARY 


Requi rement: 

Project  A  is  a  large-scale,  multiyear  research  program  intended  to  improve 
the  selection  and  classification  system  for  initial  assignment  of  persons  to 
U.S.  Army  Military  Occupational  Specialties  (MOS).  Specifically,  Project  A  is 
to  validate  new  and  existing  selection  measures  against  both  existing  and 
project -developed  criteria. 

This  report  describes  the  development  and  field  test  of  behavi orally  an¬ 
chored  rating  scales  designed  for  nine  MOS.  These  include  infantryman  (IIB), 
Cannon  Crewman  (13B),  Armor  Crewman  {19E),  Single-Channel  Radio  Operators 
(31C),  Light -Wheel  Vehicle  Mechanics  (63B),  Motor  Transport  Operators  (64C), 
Administrative  Specialists  (71L),  Medical  Specialists  (91A),  and  Military 
Police  (95B). 

Procedure: 

For  each  MOS,  the  behavioral  analysis  method  was  used  to  generate  examples 
of  effective,  average,  and  ineffective  job  performance.  These  examples  were 
used  to  identify  performance  effectiveness  dimensions  and  to  develop  behavioral 
definitions  and  standard  of  performance  for  each  dimension.  Across  the  nine 
MOS,  behavioral  summary  rating  scales  contained  from  7  to  13  performance 
dimensions. 

These  rating  scales  were  field  tested  in  continental  United  States  and 
overseas  locations.  The  first  (Batch  A)  field  test  focused  on  four  MOS,  and  the 
second  (Batch  B)  field  test  focused  on  five  MOS.  For  each  MOS,  rating  scales 
were  administered  to  120  to  160  first-term  soldiers  and  their  supervisors. 

Findi ngs ; 

Results  of  the  field  test  were  encouraging.  In  particular,  rating  session 
administrators  reported  that  participants  understood  and  complied  with  instruc¬ 
tions  and  found  the  rating  scales  useful  for  evaluating  job  performance;  inter¬ 
rater  reliability  estimates  were  reasonably  high;  and  rating  distributions  were 
acceptable  with  mean  values  slightly  above  the  midpoint. 

Utilization  of  Findings: 

The  MOS-specific  rating  scales  will  be  administered  in  the  Project  A  Con¬ 
current  Validation  study  scheduled  for  Summer  1985.  Scores  from  these  scales 
along  with  other  scores  from  other  criterion  measures  will  be  used  to  assess 


the  validity  of  existing  and  new  selection  measures.  Information  obtained  from 
the  field  tests  was  used  to  modify,  refine,  and  prepare  the  MOS-specific  rating 
scales  for  the  Concurrent  Validity  study.  Overall,  the  scales  required  very 
few  changes. 
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OVERVIEW  OF  PROJECT  A 


I 


Project  A  is  a  comprehensive  long-range  research  and  development  program 
which  the  U.S.  Army  has  undertaken  to  develop  an  improved  personnel  selec¬ 
tion  and  classification  system  for  enlisted  personnel.  The  Army’s  goal  is 
to  increase  its  effectiveness  in  matching  first-tour  enlisted  manpower 
requirements  with  available  personnel  resources,  through  use  of  new  and 
improved  selection/classification  tests  which  will  validly  predict  careful¬ 
ly  developed  measures  of  job  performance.  The  project  addresses  the 
675,000-person  enlisted  personnel  system  of  the  Army,  encompassing  several 
hundred  different  military  occupations. 

This  research  program  began  in  1980,  when  the  U.S.  Army  Research  Institute 
(ARI)  started  planning  the  extensive  research  effort  that  would  be  needed 
to  develop  the  desired  system.  In  1982  a  consortium  led  by  the  Human 
Resources  Research  Organization  (HumRRO)  and  including  the  American  Insti¬ 
tutes  for  Research  (AIR)  and  the  Personnel  Decisions  Research  Institute 
(PDRI)  was  selected  by  ARI  to  undertake  the  9-year  project.  The  total 
project  utilizes  the  services  of  40  to  50  ARI  and  consortium  researchers 
working  collegially  in  a  variety  of  specialties,  such  as  industrial  and 
organizational  psychology,  operations  research,  management  science,  and 
computer  science. 

The  specific  objectives  of  Project  A  are  to: 

•  Validate  existing  selection  measures  against  both  existing  and 
project-developed' criteria.  The  latter  are  to  include  both  Army¬ 
wide  job  performance  measures  based  on  newly  developed  rating 
scales,  and  direct  hands-on  measures  of  MOS-specific  task  perfor¬ 
mance. 

a  Develop  and  validate  new  selection  and  classification  measures. 

•  Validate  intermediate  criteria,  such  as  performance  in  training, 
as  predictors  of  later  criteria,  such  as  job  performance  ratings, 
so  that  better  informed  reassignment  and  promotion  decisions  can 
be  made  throughout  a  soldier’s  career. 

•  Determine  the  relative  utility  to  the  Army  of  different  perfor¬ 
mance  levels  across  MOS. 

t  Estimate  the  relative  effectiveness  of  alternative  selection  and 
classification  procedures  in  terms  of  their  validity  and  utility 
for  making  operational  selection  and  classification  decisions. 

The  research  design  for  the  project  incorporates  three  main  stages  of  data 
collection  and  analysis  in  an  iterative  progression  of  development,  test¬ 
ing,  evaluation,  and  further  development  of  selection/classification  in¬ 
struments  (predictors)  and  measures  of  job  performance  (criteria).  In  the 
first  iteration,  file  data  from  Army  accessions  in  fiscal  years  (FY)  1981 
and  1982  were  evaluated  to  explore  the  relationships  between  the  scores  of 
applicants  on  the  Armed  Services  Vocational  Aptitude  Battery  (ASVAB),  and 


their  subsequent  performance  in  training  and  their  scores  on  the  first-tour 
Skill  Qualification  Tests  (SQT). 

In  the  second  iteration,  a  concurrent  validation  design  will  be  executed 
with  FY85  accessions.  As  part  of  the  preparation  for  the  Concurrent 
Validation,  a  "preliminary  battery"  of  perceptual,  spatial,  temperament/ 
personality,  interest,  and  biodata  predictor  measures  was  assembled  and 
used  to  test  several  thousand  soldiers  as  they  entered  in  four  Military 
Occupational  Specialties  (MOS)  in  FY83/84.  The  data  from  this  "preliminary 
battery  sample"  along  with  information  from  a  large-scale  literature  review 
and  a  set  of  structured,  expert  judgments  were  then  used  to  identify  "best 
bet"  measures.  These  "best  bet"  measures  were  developed,  pilot  tested,  and 
refined.  The  refined  test  battery  was  then  field  tested  to  assess  reliabi¬ 
lities,  "fakability,"  practice  effects,  and  so  forth.  The  resulting  pre¬ 
dictor  battery,  now  called  the  "Trial  Battery,"  which  includes  computer- 
administered  perceptual  and  psychomotor  measures,  is  being  administered 
together  with  a  comprehensive  set  of  job  performance  indices  based  on  job 
knowledge  tests,  hands-on  job  samples,  and  performance  rating  measures  in 
the  Concurrent  Validation. 

Based  partly  on  the  results  of  the  Concurrent  Validation,  the  "Trial  Bat¬ 
tery"  will  be  revised  to  become  the  "Experimental  Predictor  Battery"  which 
in  turn  will  be  administered  as  part  of  the  longitudinal  validation  stage 
beginning  in  the  late  Summer  and  early  Fall  of  1986. 

For  both  the  concurrent  and  longitudinal  validations,  a  sample  of  19  MOS 
were  specially  selected  as  representative  of  the  Army’s  250+  entry-level 
MOS.  The  selection  was  based  on  an  initial  clustering  of  MOS  derived  from 
rated  similarities  of  job  content.  These  19  MOS  account  for  about  45 
percent  of  Army  accessions.  Sample  sizes  are  sufficient  so  that  race  and 
sex  fa.irness  can  be  empirically  evaluated  in  most  MOS. 

In  the  third  iteration  (the  longitudinal  validation),  all  of  the  measures, 
refined  on  the  basis  of  experience  in  field  testing  and  the  Concurrent 
Validation,  will  be  administered  in  a  true  predictive  validity  design. 

About  50,000  soldiers  across  20  MOS  will  be  included  in  the  FY86-87  "Ex¬ 
perimental  Predictor  Battery"  administration  and  subsequent  first-tour 
measurement.  About  3500  of  these  soldiers  are  estimated  for  availability 
for  second-tour  performance  measurement  in  FY91. 

Activities  and  progress  during  the  first  two  years  of  the  project  were 
reported  for  FY83  in  ARI  Research  Report  1347  and  its  Technical  Appendix, 
ARI  Research  Note  83-37,  and  for  FY84  in  ARI  Research  Report  1393  and  its 
related  reports,  ARI  Technical  Report  660  and  ARI  Research  Note  85-14. 

Other  publications  on  specific  activities  during  those  years  are  listed  in 
those  annual  reports.  The  annual  report  on  project-wide  activities  during 
FY85  is  under  preparation. 

For  administrative  purposes.  Project  A  is  divided  into  five  research  tasks: 

Task  1  --  Validity  Analyses  and  Data  Base  Management 
Task  2  --  Developing  Predictors  of  job  Performance 
Task  3  --  Developing  Measures  of  School/Training  Success 
Task  4  --  Developing  Measures  of  Army-Wide  Performance 
Task  5  --  Developing  MOS-Specific  Performance  Measures 


The  development  and  revision  of  the  wide  variety  of  predictor  and  criterion 
measures  reached  the  stage  of  extensive  field  testing  during  FY84  and  the 
first  half  of  FY85.  These  field  tests  resulted  in  the  formulation  of  the 
test  batteries  that  will  be  used  in  the  comprehensive  Concurrent  Validation 
program  which  is  being  initiated  in  FY85. 

The  present  report  is  one  of  five  reports  prepared  under  Tasks  2-5  to 
report  the  development  of  the  measures  and  the  results  of  the  field  tests, 
and  to  describe  the  measures  to  be  used  in  Concurrent  Validation.  The  five 
reports  are: 

Task  2  —  "Development  and  Field  Test  of  the  Trial  battery  for  Project 
A,"  Norman  G.  Peterson,  Editor,  ARI  Technical  Report  739, 

May  1987. 

Task  3  —  "Development  and  Field  Test  of  Job-Relevant  Knowledge  Tests 
for  Selected  MOS,"  Robert  H.  Davis  et  al.,  ARI  Technical 
Report  in  preparation. 

Task  4  —  "Development  and  Field  Test  of  Army-Wide  Rating  Scales  and 
the  Rater  Orientation  and  Training  Program,"  Elaine  D. 
Pulakos  and  Walter  C.  Borman,  Editors,  ARI  Technical  Report 
716,  July  1986. 

Task  5  —  "Development  and  Field  Test  of  Task-Based  MOS-Specific 
Criterion  Measures,"  Charlotte  H.  Campbell  et  al.,  ARI 
Technical  Report  717,  July  1986. 

—  "Development  and  Field  Test  of  Behaviorally  Anchored  Rating 
Scales  for  Nine  MOS,"  Jody  L.  Toquam  et  al.,  ARI  Technical 
Report  in  preparation. 


CHAPTER  1;  DEVELOPMENT  OF  BEHAVIORALLY  ANCHORED 
RATING  SCALES  (BARS) 


Objective 


The  U.S.  Army  is  examining  the  effectiveness  of  its  selection  and  classifi¬ 
cation  battery,  the  Armed  Services  Vocational  Aptitude  Battery,  in  predict¬ 
ing  training  and  job  performance  outcomes.  As  part  of  Project  A,  new 
predictor  measures  have  been  developed  to  supplement  the  current  military 
selection  and  classification  battery.  Thus,  an  important  feature  of  this 
project  involves  developing  measures  of  training  outcomes  and  job  perfor¬ 
mance  that  can  be  used  to  estimate  the  validity  of  the  ASVAB  and  the 
incremental  validities  of  the  new  measures.  The  first  wave  of  research 
activities  has  focused  on  first-term  enlistee  training  and  job  performance 
outcomes. 

Components  of  first- term  enlistee  job  performance  include  measures  of  Army¬ 
wide,  or  general  soldier  effectiveness  and  measures  of  occupation-specific 
job  requirements.  These  latter  measures  are  the  focus  of  Task  5  of  Project 
A  and  of  this  report. 

There  are  several  ways  to  define  the  performance  domain  and  to  assess 
performance  in  MOS-specific  job  areas.  For  example,  performance  may  be 
defined  by  the  major  or  critical  tasks  comprising  the  job.  Performance  on 
such  tasks  may- be  assessed  by  measures  that  simulate  critical  activities  of 
the  job  (e.g.,  hands-on  tests),  written  tests  that  measure  incumbents’ 
knowledge  of  the  critical  components  of  the  job  (e.g.,  job  knowledge 
tests),  or  measures  that  ask  persons  familiar  with  target  incumbents  to 
evaluate  incumbents’  performance  in  the  task  areas,  using  specially  de¬ 
signed  rating  scales. 

Another  means  of  assessing  performance  involves  identifying  broad  dimen¬ 
sions  that  define  the  critical  job  performance  requirements.  These  dimen¬ 
sions  may  then  be  used  to  develop  rating  scales  that  measure  performance 
effectiveness  more  broadly  than  task-oriented  assessment  instruments.  Once 
again  persons  familiar  with  target  incumbents  are  asked  to  evaluate  incum¬ 
bents’  performance,  using  these  rating  scales. 

For  Task  5,  both  approaches  have  been  used  to  measure  job  performance. 

That  is,  instruments  assessing  performance  or  knowledge  in  critical  task 
areas  and  assessing  performance  on  broad  dimensions  have  been  developed. 

In  this  report,  we  document  the  procedures  and  activities  in  developing 
MOS-specific  performance  appraisal  forms  that  assess  job  effectiveness  on 
broad  behavioral  dimensions.  (Documentation  of  development  activities  of 
task-oriented  performance  measures  may  be  found  in  Campbell,  Campbell, 
Rumsey,&  Edwards,  1986.) 

This  report  contains  three  chapters.  In  Chapter  1,  we  describe  the  proce¬ 
dures  used  to  develop  behaviorally  anchored  performance  rating  scales,  the 
sample  of  participants  involved  in  defining  the  performance  dimensions,  and 
the  resulting  performance  rating  scales.  Chapter  2  contains  a  description 
of  the  procedures  used  in  field  testing  the  newly  developed  scales,  along 


with  results  from  the  field  test.  Finally,  in  Chapter  3,  we  discuss  deci¬ 
sions  concerning  rating  scale  modifications  and  present  the  final  set  of 
behavi orally  anchored  rating  scales  (BARS)  to  be  used  in  the  Concurrent 
Validation  administration. 


Background 


The  procedure  used  to  identify  MOS-specific  job  duties  was  derived  in  large 
part  from  procedures  outlined  by  Smith  and  Kendall  (1963)  and  by  Campbell, 
Dunnette,  Arvey,  and  Hellervik  (1973).  According  to  Smith  and  Kendall, 
performance  appraisal  rating  scales  should  emphasize  activity  or  perfor¬ 
mance  that  can  be  observed  on  the  job.  Their  recommended  procedure  in¬ 
volves  identifying  behaviors  that  lead  to  effective  or  ineffective  job 
performance  outcomes  and  avoids  focusing  on  unobservable  or  nonbehavioral 
attributes.  Another  feature  of  this  methodology  involves  developing  rating 
scales  that  incorporate  the  language  of  the  users  and  that  reflect  stan¬ 
dards  which  users  help  to  define.  Thus,  activities  to  develop  rating 
scales  include  the  users  in  all  phases  of  scale  construction.  Details  of 
the  development  process  are  described  below. 


Smith  and  Kendall  were  the  first  to  recommend  using  the  critical  incident 
technique  described  by  Flanagan  (1954)  to  identify  the  major  dimensions  or 
categories  of  job  performance.  This  is  accomplished  by  asking  those  most 
familiar  with  the  job--supervisors  and  incumbents--to  describe  or  write 
examples  of  effective,  average,  and  ineffective  behavior  observed  on  the 
job. 


These  authors  recommend  conducting  critical  incident  workshops  that,  as  a 
first  step,  name  and  define  the  major  components  of  performance  for  the  job 
in  question.  Workshop  participants  are  then  asked  to  write  examples  of 
effective  and  ineffective  performance  for  each  of  the  major  components  they 
have  identified. 


Campbell  et  al .  (1973)  suggest  a  slight  modification  to  the  Smith  and 
Kendall  procedure.  They  recommend  that  performance  categories  be  generated 
after  participants  have  had  an  opportunity  to  write  several  incidents.  In 
this  way,  participants  will  not  be  constrained  by  working  with  a  priori 
performance  categories  and  are  more  likely  to  write  performance  examples 
that  represent  all  job  requirements.  Thus,  it  is  less  likely  that  im¬ 
portant  job  duties  will  be  overlooked. 

The  next  step  involves  editing  the  written  performance  examples  or  critical 
incidents.  Here,  Smith  and  Kendall  emphasize  the  need  for  retaining  the 
"flavor"  of  the  incidents  to  ensure  that  terminology  used  on  the  job  also 
appears  in  the  rating  scales. 

These  edited  incidents  are  then  used  to  identify  the  major  dimensions  of 
the  job.  Two  or  more  researchers  independently  content  analyze  the  in¬ 
cidents  and  sort  them  into  performance  dimensions,  and  then  compare  their 
results  to  form  a  performance  dimension  system.  Performance  categories 
generated  in  workshop  discussions  may  be  used  to  help  label  and  define  the 
resulting  performance  dimensions. 
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Next,  supervisors  and  incumbents  are  called  in  to  participate  in  a  re¬ 
translation  exercise.  They  are  asked  to  read  the  performance  incidents  and 
make  two  ratings  for  each.  First,  they  must  assign  each  incident  to  a 
performance  dimension  based  on  the  behavior  described  in  the  incident. 
Second,  raters  are  asked  to  indicate  the  effectiveness  level  of  the  be¬ 
havior. 

Results  from  this  exercise  are  used  to  evaluate  the  performance  dimension 
system  to  ensure  that  dimensions  are  clear  and  that  raters  can  effectively 
allocate  behavioral  examples  into  each  with  a  high  level  of  agreement. 
Further,  retranslation  ratings  are  used  to  develop  behavioral  standards 
that  represent  performance  at  various  effectiveness  levels.  The  final 
product  is  a  set  of  behaviorally  defined  and  anchored  performance  dimen¬ 
sions  that  focus  on  the  duties  and  standards  of  a  specific  job  or  MOS. 

Guidelines  for  developing  behaviorally  anchored  rating  scales,  established 
by  Smith  and  Kendall  (1963)  and  by  Campbell  et  al .  (1973),  were  used 
throughout  the  conduct  of  this  part  of  Task  5.  In  the  next  section  we 
describe  in  detail  the  development  of  behaviorally  anchored  rating  scales 
for  first-term  enlistees. 


Method 


Target  Military  Occupational  Specialties  (MOS) 


As  noted,  the  purpose  of  this  part  of  Task  5  was  to  develop  behaviorally 
anchored  performance  rating  scales  that  highlight  specific  job  requirements 
for  nine  MOS.  The  pool  of  MOS  that  had  been  selected  for  inclusion  in 
Project  A  comprised  19  specialties  identified  as  representative  of  the  more 
than  200  enlisted  occupations  in  the  Army. 

Very  early  in  the  project  it  was  deemed  infeasible  to  develop  specific  job 
performance  measurement  instruments  for  all  of  the  selected  MOS.  There¬ 
fore,  a  subset  comprised  of  nine  occupational  specialties  was  selected  for 
developing  MOS-specific  performance  measures.  These  MOS  were  chosen  on  the 
basis  of  the  total  number  of  persons  in  each  and  the  type  of  work  per¬ 
formed.  The  objective  was  to  identify  MOS  that  have  fairly  large  numbers 
and  that  represent  different  primary  missions  (i.e.,  combat  arms,  combat 
support,  noncombat).  The  nine  MOS  selected  are: 

IIB  Infantryman 
13B  Cannon  Crewman 
19E  Armor  Crewman 

31C  Radio  Teletype  Operator  (Originally  coded  05C) 

63B  Light-Wheel  Vehicle  Mechanic 

64C  Motor  Transport  Operator 

71L  Administrative  Specialist 

91A  Medical  Specialist  (Originally  coded  91B) 

95B  Military  Police 


First,  the  nine  MOS  were  divided  into  two  groups  or  batches.  Batch  A  and 
Batch  B.  The  MOS  in  the  first  group  (Batch  A)  are  13B,  64C,  71L,  and  95B; 
those  included  in  the  second  group  (Batch  B)  are  IIB,  19E,  31C,  63B,  and 


91A.  Dividing  the  nine  MOS  into  two  groups  made  it  possible  to  design  and 
use  data  collection  procedures  for  the  first  group,  develop  performance 
rating  scales,  and  try  them  out  in  the  field.  Before  beginning  work  on  the 
second  batch,  we  evaluated  our  procedures  and  modified  them  to  improve  and 
streamline  the  scale  development  process.  For  the  most  part,  the  proce¬ 
dures  employed  for  the  Batch  A  MOS  are  very  similar  to  those  used  to 
develop  scales  for  Batch  B  MOS.  Where  procedures  differed  for  the  two 
batches,  we  describe  the  differences  and  the  rationale  for  the  modifica¬ 
tions. 

Each  of  the  nine  MOS  was  assigned  to  a  PDRI  research  staff  member,  who  was 
responsible  for  (1)  conducting  workshops  to  collect  performance  incidents 
for  the  assigned  MOS,  (2)  editing  incidents,  (3)  preparing  retranslation 
exercises,  (4)  developing  performance  rating  scales,  and  (5)  revising  the 
scales  for  the  Concurrent  Validation  efforts.  Thus,  a  single  researcher 
became  an  "expert"  concerning  the  job  duties  and  requirements  involved  in 
the  assigned  MOS. 

Please  note  that  we  have  prepared  nine  appendices  that  correspond  to  the 
nine  MOS  included  in  the  project.  These  are  located  in  a  separate  report, 

ARI  Research  Note _ ,  1985  (four  volumes).  They  appear  in  the  following 

order:  Appendix  A  -  13B  Cannon  Crewman;  Appendix  B  -  64C  Motor  Transport 
Operator;  Appendix  C  -  71L  Administrative  Specialist;  Appendix  D  -  95B 
Military  Police;  Appendix  E  -  IIB  Infantryman;  Appendix  F  -  19  E  Armor 
Crewman;  Appendix  G  -  31C  Radio  Teletype  Operator;  Appendix  H  -  63B 
Light-Wheel  Vehicle  Mechanic;  and  Appendix  I  -  91A  Medical  Specialist. 


Sample 

We  modified  the  procedures  somewhat  from  those  described  by  Smith  and 
Kendall  (1963)  and  Campbell  et  al .  (1973).  For  example,  incumbents  or 
first-term  enlistees  from  target  MOS  were  not,  as  a  rule,  included  in  the 
workshops.  We  reasoned  here  that  first- termers,  especially  those  who  had 
been  in  the  Army  for  only  a  year  or  two,  would  not  have  had  the  opportunity 
to  obtain  the  "big  picture"  of  MOS-specific  job  requirements.  Therefore, 
to  ensure  that  workshop  participants  were  familiar  with  first-term  enlistee 
job  requirements,  most  individuals  selected  to  participate  in  the  workshops 
were  non-commissioned  officers  (NCOs)  directly  responsible  for  supervising 
first-term  enlistees  and  hence  were  equivalent  to  first-line  supervisors. 
Further,  most  of  the  NCOs  included  in  the  sample  had  spent  two  to  four 
years  as  first-termers  in  these  MOS,  and  therefore  were  familiar  with  the 
job  requirements  from  an  "incumbent"  as  well  as  a  "supervisor"  perspective. 

To  ensure  thorough  coverage  and  representation  of  the  critical  behaviors  in 
each  MOS,  workshops  for  each  MOS  were  conducted  at  six  CONUS  (Continental 
United  States)  Army  posts.  Posts  included  in  Batch  A  workshops  were  Fort 
Ord,  California;  Fort  Polk,  Louisiana;  Fort  Bragg,  North  Carolina;  Fort 
Campbell,  Kentucky;  Fort  Hood,  Texas;  and  Fort  Carson,  Colorado.  Those 
scheduled  for  Batch  B  workshops  were  Fort  Lewis,  Washington;  Fort  Stewart, 
Georgia;  Fort  Riley,  Kansas;  Fort  Bragg,  North  Carolina;  Fort  Sill, 
Oklahoma;  and  Fort  Bliss,  Texas.  The  workshop  schedule  for  collecting 
performance  incidents  at  each  of  these  sites  is  provided  in  Table  1. 
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Table  1 


WorkshoD  Locations  and  Dates 

Location 

Dates 

Batch  A 

Fort  Ord 

25 

-  26  August  1983 

Fort  Polk 

29 

-  30  August  1983 

Fort  Bragg 

12 

-  13  September  1983 

Fort  Campbell 

•15 

-  16  September  1983 

Fort  Hood 

13 

-  14  October  1983 

Fort  Carson 

31 

October  -  1  November  1983 

Batch  B 

Fort  Lewis 

9 

-  11  January  1984 

Fort  Stewart 

11 

-  13  January  1984 

Fort  Riley 

16 

-  18  January  1984 

Fort  Bragg 

27 

-  29  February  1984 

Fort  Bliss 

12 

-  14  March  1984 

Fort  Sill 

14 

-  16  March  1984 

At  each  Army  post,  our  point-of-contact  (POC)  was  asked  to  obtain  from  10 
to  16  NCOS  from  each  target  MOS.  Thus,  the  goal  was  to  obtain  input  from 
about  60  to  96  supervisors  for  each  MOS.  The  total  numbers  of  NCOs  par¬ 
ticipating  in  the  performance  incident  workshops  by  MOS  were  as  follows: 
13B--N=88;  64C--N=81;  71L--N=63;  95B--N=86;  11B--N=83;  19E--N=65;  31C-- 
N=60;  63B--N=75;  and  91A--N=71. 

A  breakdown  of  each  MOS  workshop  sample  by  rank  and  by  gender  is  provided 
in  Tables  2  and  3  for  Batch  A  and  Batch  B  MOS.  For  one  MOS  the  total  number 
of  participants  reported  by  rank  does  not  equal  the  total  reported  above, 
because  a  few  participants  did  not  report  their  rank.  It  is  also  important 
to  note  that  for  three  MOS  no  females  participated,  because  these  three 
M0S--13B,  19E,  and  llB--involve  combat  duty,  which  precludes  females  from 
enlisting  in  them. 

As  the  information  in  the  tables  indicates,  the  bulk  of  the  workshop  sam¬ 
ples  consisted  of  NCOs  at  the  E-5  and  E-6  levels.  In  some  cases,  however, 
participants  were  enlistees  of  lower  rank,  such  as  E-1  and  E-2;  these 
individuals  were  first-term  enlistees  with  less  that  one  year  of  job  ex¬ 
perience.  Also,  some  workshop  sessions  contained  NCOs  at  the  E-8  and  E-9 
level.  These  individuals  have  less  direct  responsibilities  for  supervising 
first-term  enlistees  and  can  be  considered  equivalent  to  second-line  super¬ 
visors. 


Performance  Incident  Data  Collection  Activities 


Workshop  Description.  We  began  each  workshop  session  by  providing  partici¬ 
pants  with  booklets  containing  information  about  Project  A  and  about  the 
day’s  activities.  We  have  included  the  booklets  used  for  each  MOS  in 
Section  1  of  Appendices  A  through  I. 


The  schedule  of  activities  followed  for  each  critical  incident  workshop  for 
all  MOS  is  shown  in  Table  4.  Workshop  leaders  first  provided  a  description 
of  Project  A,  then  briefed  participants  on  the  purpose  of  the  workshop. 

This  led  to  discussion  of  the  different  types  of  performance  rating  scales 
available,  and  the  advantages  of  using  behaviorally  anchored  rating  scales 
to  assess  job  performance.  Leaders  then  described  how  the  results  from  the 
day’s  activities  would  be  used  to  develop  this  type  of  rating  scale  for 
that  particular  MOS. 


Next,  workshop  leaders  provided  instruction  for  writing  performance  in¬ 
cidents.  This  included  a  description  of  the  information  required  in  each 
incident,  such  as  the  setting,  the  behaviors  observed,  and  the  outcome  (or 
what  happened  as  a  result  of  the  behavior).  Participants  were  asked  to 
review  several  examples  in  their  booklets  to  get  an  idea  of  how  to  write 
performance  incidents.  The  examples  of  "bad"  incidents  contained  ir¬ 
relevant  information  or  lacked  important  information,  whereas  the  "good" 
examples  were  corrected  versions  that  contained  all  necessary  information. 


Workshop  leaders  then  distributed  performance  incident  forms  and  asked 
participants  to  generate  performance  incidents,  using  the  examples  as 
guides.  Figure  1  shows  a  sample  form  that  participants  used  to  generate 
incidents. 


Job  Described 


I 


1.  What  were  the  circumstances  leading  up  to  the  incident? 


2.  What  did  the  individual  do  that  made  you  feel  he  or  she  was  a  good, 
average,  or  poor  performer? 


3.  In  what  job  performance  category  would  you  say  this  incident  falls? 


4.  Circle  the  number  below  that  best  reflects  the  correct  effectiveness 
level  for  this  example: 


1  2345  6  789 

extremely  ineffective  about  effective  extremely 

ineffective  average  effective 


Table  2 

Performance  Incident  Workshops : 


Rank  and  Gender  of  Batch  A  Particioant  Sample  by  MOS 


13B  -  Cannon  Crewman 

Rank  N  ? 

El  0  0. 

E2  0  0, 

E3  0  0 

E4  2  2 

E5  49  55 

E6  29  33 

E7  7  8 

E8  1  1 

E9  0  0 

Total  88 

Gender 

M  88  100 

F  0  0 


-  Administrative  Specialist* 

Rank  N  %_ 

El  0  0.0 

E2  1  1.6 

E3  3  4.9 

E4  0  0.0 

E5  27  44.3 

E6  10  16.4 

E7  12  19.7 

E8  7  11.5 

E9  1  1.6 

Total  61 

Gender 

M  44  69.8 

F  19  30.2 


64C  -  Motor  Transport  Operator 

Rank  N  %_ 

El  0  0.0 

E2  0  0.0 

E3  3  3.9 

E4  4  5.2 

E5  34  44.7 

E6  27  35.5 

E7  8  10.5 

E8  0  0.0 

E9  0  0.0 

Total  76 

Gender 

M  74  97.4 

F  2  2.6 


95B  -  Military  Police 

Rank  N  ? 

El  0  o' 

E2  0  0, 

E3  0  0, 

E4  0  0, 

E5  39  45. 

E6  24  27. 

E7  16  18. 

E8  6  6, 

E9  11 

Total  86 

Gender 

M  84  97, 

F  2  2 


®The  total  sample  size  by  rank  does  not  equal  the  total  sample  by  gender 
because  two  individuals  failed  to  report  their  rank. 


Table  3  (Continued) 


Performance  Incident  Workshops: 


Light-Wheel 

Vehicle  Mechanic 

91A  -  Medical 

Specialist  m 

Rank 

N 

% 

Rank 

N 

% 

El 

1 

1.3 

El 

1 

1.4 

E2 

3 

4.0 

E2 

2 

2.8 

E3 

4 

5.3 

E3 

1 

1.4 

E4 

5 

6.7 

E4 

13 

18.3  - 

E5 

35 

46.7 

E5 

26 

36.6 

E6 

20 

26.7 

E6 

17 

23.9 

E7 

6 

8.0 

E7 

8 

11.3 

E8 

1 

1.3 

E8 

3 

4.2 

E9 

0 

0.0 

E9 

0 

0.0  i 

Total 

75 

Total 

71 

Gender 

Gender 

: 

M 

72 

96.0 

M 

54 

76.1 

F 

3 

4.0 

F 

17 

23.9 

I 

Agenda  for  Performance  Incident  Workshop  i 


Time 

0800  -  0815 
0815  -  0845 
0845  -  1130 
1130  -  1230 
1230  -  1430 
1430  -  1530 


_ Topic _ 

Description  of  the  project 
Briefing  on  the  day’s  activities 
Generating  performance  examples 
Lunch 

Generating  more  performance  examples 

Discussion  of  performance  categories 
emerging  in  the  workshop 


1530  -  1615 


1615  -  1630 


Generating  more  performance  examples 

Review  of  the  day’s  activities  and 
discussion  of  the  next  steps 
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While  writing  performance  incidents,  participants  were  encouraged  to  avoid 
activities  or  behaviors  that  reflect  general  soldier  effectiveness  (e.g,, 
following  rules  and  regulations,  military  appearance);  such  requirements 
have  been  identified  and  described  in  a  separate  part  of  the  project.  (See 
Borman,  Motowidlo,  Rose  &  Hanser,  1984;  and  Borman  &  Rose,  1986  for  a 
complete  description  of  the  Army-wide  rating  scales  designed  to  assess 
general  soldier  effectiveness.) 

As  indicated  earlier,  the  objective  of  these  workshops  was  to  generate 
examples  of  effective,  average,  and  ineffective  performance  in  each  of  the 
target  MOS.  To  ensure  thorough  coverage  of  each  MOS,  workshop  leaders 
established  goals  for  participants.  Participants  were  informed  early  in 
the  day  that  each  was  expected  to  generate  about  14  to  16  incidents;  for 
the  entire  group,  we  requested  about  200  performance  incidents.  (This  goal 
applied  to  groups  with  12  to  16  participants;  it  was  modified  accordingly 
for  smaller  groups.)  To  many  participants  that  goal  seemed  unreasonably 
high,  but  as  each  workshop  session  progressed,  it  became  clear  that  all 
participants  could  (and  usually  did)  meet  the  established  goals. 

As  participants  finished  writing  an  incident,  workshop  leaders  reviewed  it 
to  ensure  that  it  clearly  described  the  situation,  the  behavior  or  activi¬ 
ty,  and  the  outcome  of  the  incident.  They  also  identified  terminology  and 
Army  acronyms  that  were  unclear  or  obscure  and  asked  participants  to  clari¬ 
fy  them. 

Participants  continued  to  generate  performance  incidents  until  it  was  time 
to  break  for  lunch.  Following  lunch,  workshop  leaders  asked  participants 
to  resume  writing  incidents  for  about  two  more  hours.  At  that  time,  per¬ 
formance  incident  writing  was  halted  and  workshop  leaders  began  generating 
discussion  among  participants  to  identify  the  major  components  or  activi¬ 
ties  comprising  the  job  or  MOS. 

During  this  discussion,  participants  were  asked  to  identify  the  major  job 
performance  categories.  Workshop  leaders  recorded  suggested  categories  on 
a  blackboard  or  flipchart.  When  participants  indicated  that  all  possible 
performance  categories  had  been  identified,  the  leader  asked  them  to  review 
the  list  and  consider  whether  or  not  all  job  duties  did  indeed  appear.  The 
leader  also  asked  them  to  consider  whether  each  category  represented  first- 
term  enlistee  job  requirements  or  requirements  of  more  experienced 
soldiers. 

Following  this  discussion,  participants  were  asked  to  review  the  perfor¬ 
mance  incidents  they  had  written  and  to  assign  them  to  one  of  the  job 
categories  or  dimensions  that  appeared  on  the  blackboard  or  flipchart.  The 
workshop  leader  then  tallied  the  total  number  of  incidents  in  each  catego¬ 
ry.  Those  categories  with  very -few  incidents  were  the  focus  of  the  re¬ 
mainder  of  the  workshop;  participants  were  asked  to  spend  the  remaining 
time  generating  performance  incidents  for  those  categories  represented  by 
only  a  few  performance  incidents. 

At  the  end  of  the  session,  workshop  leaders  discussed  the  next  steps  in  the 
project.  We  informed  participants  that  in  a  few  months  they  would  be  asked 
to  participate  in  another  part  of  the  study,  which  would  involve  retrans- 
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lating  the  performance  incidents  collected  from  all  NCOs  in  the  same  MOS. 

The  plan  for  this  portion  of  the  rating  scale  development  strategy  involved 
mailing  the  retranslation  exercise  to  all  participants.  (This  strategy  was 
used  only  for  Batch  A  MOS;  for  Batch  B  a  slightly  different  approach  was 
used.)  Details  about  the  retranslation  exercise  are  provided  later  in  this 
chapter. 

Results  from  the  performance  incident  workshops  are  reported  in  Table  5  for 
Batch  A  MOS  and  in  Table  6  for  Batch  B  MOS.  In  these  tables,  we  report  the 
number  of  workshop  participants  and  number  of  performance  incidents  gen¬ 
erated  by  MOS  and  by  location,  as  well  as  the  mean  number  of  incidents 
generated  by  MOS  and  location.  The  tables  also  show  the  total  number  of 
participants  and  total  number  of  incidents  by  MOS  and  by  location. 

For  Batch  A,  the  total  number  of  participants  for  each  MOS  ranged  from  63 
for  Administrative  Specialist  (71L)  to  88  for  the  Cannon  Crewman  (13B) 
group.  The  number  of  incidents  generated  within  each  MOS  ranges  from  989 
for  the  Administrative  Specialist  (71L)  to  1183  for  Military  Police  (95B). 
Finally,  the  average  number  of  performance  incidents  provided  by  partici¬ 
pants  within  MOS  ranged  from  13.2  for  Cannon  Crewman  (13B)  to  15.7  for 
Administrative  Specialist  (71L). 

For  Batch  B,  the  total  number  of  participants  within  MOS  ranged  from  60  for 
Radio  Teletype  Operator  (31C)  to  83  for  Infantryman  (IIB).  The  total 
number  of  incidents  generated  for  each  MOS  ranged  from  761  for  Medical 
Specialist  (91A)  to  993  for  Infantryman  (llB).  (The  total  number  of  in¬ 
cidents  generated  within  an  MOS  was  less  for  Batch  B  MOS  than  for  Batch  A 
MOS,  due  to  modifications  in  the  procedures  used  for  the  Batch  B  retransla¬ 
tion  exercise.  These  modifications  are  described  in  the  Retranslation 
section  of  this  chapter.)  The  average  number  of  incidents  generated  by 
each  participant  within  an  MOS  ranged  from  10.7  for  Medical  Specialist 
(91A)  to  13.0  for  Radio  Teletype  Operator  (31C). 

These  data  indicate  that  we  were  successful  in  obtaining  the  number  of 
participants  requested,  and  that  participants  in  each  MOS  provided  an  ample 
number  of  performance  incidents  for  developing  behaviorally  anchored  rating 
scales  reflecting  MOS-specific  job  requirements. 

Activities  Between  Workshop  Sessions.  Performance  incident  workshops  for 
each  batch  were  conducted  over  a  period  of  three  months.  This  schedule 
permitted  the  research  staff  to  edit  and  review  performance  incidents 
between  data  collection  activities.  Thus,  for  Batch  A  MOS,  staff  members 
edited  incidents  collected  at  Fort  Ord  and  Fort  Polk  before  collecting  more 
incidents  at  Fort  Bragg  and  Fort  Campbell.  Also  during  this  time,  staff 
members  reviewed  the  incidents  and  the  performance  categories  generated  in 
the  group  discussion  to  construct  a  preliminary  performance  dimension 
system. 

These  performance  dimensions  were  then  presented  and  discussed  at  Fort 
Bragg  and  Fort  Campbell.  Following  the  data  collection  activities  at  these 
posts,  the  process  was  again  repeated.  That  is,  performance  incidents  were 
edited,  content  analyzed,  and  sorted  into  categories.  These  categories 
were  then  integrated  with  those  generated  during  the  discussion  with  work¬ 
shop  participants.  And,  once  again,  the  new  performance  dimension  catego- 
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Table  5 

Performance  Incident  Workshops:  Humber  of  Participants  and 
Number  of  Incidents  Generated  bv  HOS  and  bv  location  -  Batch  A 


HOS 


Location 

138 

64C 

ZIL 

95B 

Total  By 
Location 

Fort  Ord 

N  -  Participants 

U 

10 

5 

14 

43 

N  -  Incidents 

195 

80 

59 

213 

547 

Mean  Per  Participant 

13.9 

8.0 

11.8 

15.2 

12.7 

Fort  Polk 

N  •  Participants 

12 

15 

15 

IS 

57 

N  •  Incidents 

150 

240 

210 

235 

835 

Hean  Per  Participant 

12.5 

16.0 

14.0 

15.7 

14.7 

Fort  Bragg 

N  -  Participants 

13 

14 

11 

17 

55 

N  •  Incidents 

235 

221 

218 

225 

899 

Mean  Per  Participant 

18.1 

15.8 

19.8 

13.2 

16.4 

Fort  Campbell 

N  -  Participants 

17 

14 

9 

15 

55 

N  -  Incidents 

195 

191 

154 

238 

778 

Mean  Per  Participant 

11.5 

13.6 

17.1 

15.9 

14.2 

Fort  Hood 

N  -  Participants 

13 

13 

10 

11 

47 

N  -  Incidents 

180 

183 

133 

92 

588 

Mean  Per  Participant 

13.9 

14.1 

13.3 

8.4 

10.7 

Fort  Carson 

N  -  Participants 

19 

15 

13 

14 

61 

N  -  Incidents 

204 

232 

215 

180 

831 

Mean  Per  Participant 

10.7 

15.5 

16.5 

12.9 

13.6 

Totals  Bv  MOS 

N  -  Participants 

88 

81 

63 

86 

318 

N  -  Incidents 

1159 

1147 

989 

1183 

4478 

Mean  Per  Participant 

13.2 

14.2 

15.7 

13.8 

14.1 
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Table  6 

Performance  Incident  Uorkshops;  Humber  of  Participants  and 


Humber  of  Incidents  Generated  bv  HQS  and  bv  Location  •  Batch  B 


HOS 


Location 


11B 


19E 


31C 


63B 


Fort  Lewis 
N  -  Participants 
N  •  Incidents 
Mean  Per  Participant 
Fort  Stewart 
N  -  Participants 
N  •  Incidents 
Mean  Per  Participant 
Fort  Riley 
N  •  Participants 
H  *  Incidents 
Mean  Per  Participant 
Fort  Bragg 
M  •  Participants 
N  -  Incidents 
Mean  Per  Participant 
Fort  Sill® 

N  -  Participants 
H  •  Incidents 


Mean  Per  Participant 
Fort  Bliss® 


N  -  Participants 
N  -  Incidents 
Mean  Per  Participant 
Total  bv  MOS 
N  ■  Participants 
N  -  Incidents 
Mean  Per  Participant 


91A 


Total  by 
Location 


16 

11 

8 

10 

11 

56 

211 

180 

124 

172 

130 

817 

13.8 

16.4 

15.5 

17.2 

11.8 

14.6 

U 

15 

15 

16 

16 

76 

216 

275 

256 

208 

249 

1204 

15.4 

18.3 

17.1 

13.0 

15.6 

15.8 

18 

7 

10 

11 

8 

54 

216 

123 

127 

133 

90 

689 

12.0 

17.6 

12.7 

12.1 

11.3 

13.8 

13 

14 

16 

15 

13 

71 

231 

190 

220 

250 

217 

1108 

17.8 

13.6 

13.8 

16.7 

16.7 

15.6 

8 

4 

3 

9 

10 

34 

26 

0 

13 

32 

20 

91 

3.3 

4.3 

3.6 

2.0 

2.7 

14 

14 

8 

14 

13 

63 

93 

70 

39 

71 

55 

328 

6.6 

5.0 

4.9 

5.1 

4.2 

5.2 

83 

65 

60 

75 

71 

354 

993 

838 

779 

866 

761 

4237 

12.0 

12.9 

13.0 

11.6 

10.7 

12.0 

;  spent 

most  of  the 

time  completing  retranslation 

booklets 

rather  than  generating  critical  incidents. 
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ries  were  presented  and  discussed  with  participants  in  workshops  held  at 
Fort  Hood  and  Fort  Carson. 

A  similar  iterative  procedure  was  used  to  generate  Batch  B  performance 
dimensions.  Performance  incidents  collected  at  Fort  Lewis,  Fort  Stewart, 
and  Fort  Riley  were  edited,  content  analyzed,  and  then  sorted  into  perfor¬ 
mance  dimensions.  Results  from  the  sort  were  presented  and  discussed  at 
the  next  site.  Fort  Bragg.  The  procedures  followed  for  the  final  two  forts 
for  Batch  B,  Fort  Sill  and  Fort  Bliss,  differed  slightly  from  those  used 
for  Batch  A  MOS;  these  procedural  differences  are  discussed  in  the  next 
section. 


Retranslation  Activities 

Rationale.  A  primary  purpose  of  the  retranslation  exercise  is  to  verify 
that  the  performance  dimension  system  represents  thorough  and  comprehensive 
coverage  of  the  critical  job  requirements.  Persons  familiar  with  the 
target  job  are  asked  to  review  the  performance  incidents  generated  for  that 
job. 

After  reviewing  each  incident,  participants  must  first  assign  it  to  one  of 
the  performance  dimensions.  The  objective  here  is  to  identify  performance 
incidents  with  high  levels  of  agreement  (e.g.,  50%  or  greater)  in  perfor¬ 
mance  dimension  assignment. 

A  second  objective  is  to  construct  performance  anchors  for  each  dimension. 
This  information  is  obtained  from  a  second  rating  participants  provide  for 
each  incident,  which  involves  evaluating  the  effectiveness  of  the  behavior 
described.  These  ratings  are  used  to  help  define  each  performance  dimen¬ 
sion  and  to  construct  behavioral  anchors  that  describe  typical  performance 
at  different  effectiveness  levels  within  that  dimension.  Such  anchors  are 
designed  to  ensure  that  raters  use  the  same  standards  of  performance  to 
evaluate  ratees.  That  is,  they  provide  raters  with  systematic  information 
about  behaviors  that  comprise  ineffective  performance,  average  performance, 
and  effective  performance  within  a  particular  dimension. 

Performance  dimension  anchors  are  derived  directly  from  performance  in¬ 
cidents.  To  construct  anchors,  performance  incidents  that  all  or  most 
raters  agree  describe  activity  in  a  single  performance  dimension  are  iden¬ 
tified  along  with  incidents  that  most  raters  agree  depict  performance  at  a 
particular  effectiveness  level.  Those  incidents  are  then  used  to  develop 
the  anchor  for  performance  at  that  effectiveness  level.  In  summary,  we  are 
looking  for  high  agreement  among  raters  on  performance  dimension  assignment 
of  incidents  (or  high  percentage  agreement)  and  high  agreement  among  raters 
for  the  effectiveness  level  demonstrated  in  each  incident  (or  low  standard 
deviations). 

Retranslation  procedures  employed  for  Batch  A  MOS  differed  from  those  for 
Batch  B  MOS.  Below  we  describe  the  activities  in  retranslating  the  perfor¬ 
mance  incidents  for  Batch  A  MOS.  We  then  discuss  some  of  the  problems  in 
using  these  procedures  and  the  modifications  made  for  Batch  B  MOS  re¬ 
translation  activities. 


Retranslation  Materials  and  Procedures  -  Batch  A.  The  Smith  and  Kendall 
(1963)  procedure  calls  for  including  individuals  familiar  with  the  target 
job  to  participate  in  the  retranslation  process.  For  the  Batch  A  MOS,  we 
planned  to  include  workshop  participants  in  this  phase  of  the  project. 
(Recall  that  these  persons  were  supervisors  of  the  target  incumbents  and, 
hence,  as  a  rule,  did  not  include  the  incumbent  group.)  During  the  perfor¬ 
mance  incident  workshops  participants  were  informed  that  we  would  contact 
them  via  the  mail  to  complete  another  phase  of  project. 

In  the  last  performance  incident  workshop,  conducted  at  Fort  Carson,  parti¬ 
cipants  for  each  MOS  were  given  a  “practice"  retranslation  package  which 
included  instructions  for  completing  the  exercise,  a  list  and  description 
of  performance  dimensions,  and  a  subset  of  the  edited  performance  in¬ 
cidents.  The  number  of  incidents  retranslated  varied  by  MOS;  13B  examined 
240  incidents,  64C  14  incidents,  71L  up  to  200  incidents,  and  95B  100 
incidents. 

This  "practice"  retranslation  exercise  was  conducted  to  ensure  that  the 
instructions  and  completed  example  incidents  clearly  explained  the  task. 
Workshop  leaders  simply  passed  out  the  materials  to  participants  and  in¬ 
structed  them  to  complete  the  task;  no  further  instructions  were  provided. 

As  participants  finished,  leaders  noted  any  questions  or  problems  that  they 
had  experienced.  This  information  was  used  to  modify  the  retranslation 
instructions  and  the  example  items.  The  final  sets  of  retranslation  ma¬ 
terials,  including  instructions,  examples,  and  performance  dimensions  and 
definitions,  are  provided  in  Section  2  of  the  MOS  appendices. 

In  designing  the  retranslation  exercise  booklets,  we  first  screened  all 
performance  incidents  and  removed  duplicates,  incidents  that  were  unclear 
or  incomplete,  and  any  that  depicted  Army-wide  rather  than  MOS-specific  job 
requirements. 

After  taking  a  count  of  the  remaining  incidents,  we  concluded  that  it  was 
impractical  to  ask  participants  to  rate  all  performance  incidents  generated 
for  their  MOS.  As  shown  in  Tables  5  and  6,  the  number  of  incidents  gen¬ 
erated  for  each  MOS  ranged  from  761  to  1183  (the  actual  number  of  perfor¬ 
mance  incidents  was  somewhat  lower  than  that  due  to  the  screening  proce¬ 
dures  employed).  Instead,  we  constructed  a  less  onerous  task  that  asked 
participants  to  retranslate  only  a  subset  of  the  total  number;  they  were 
asked,  on  the  average,  to  retranslate  about  200  performance  incidents. 

Thus,  for  each  MOS  we  constructed  four  or  five  booklets  containing  unique 
performance  incidents  for  the  retranslation  exercise. 

Return  rates  across  all  Batch  A  MOS  indicated  that,  on  the  average,  only 
about  20  percent  of  the  participants  completed  the  retranslation  task. 

This  number  proved  insufficient  for  the  analyses  we  planned.  To  increase 
the  number  of  retranslation  ratings,  we  conducted  retranslation  workshops 
at  Fort  Meade,  Maryland.  These  workshops  included  NCOs  from  the  four  MOS 
who  were  familiar  with  first-term  enlistee  job  requirements.  Project  staff 
members  from  HUMRRO  who  were  familiar  with  the  job  requirements  of  one  or 
more  MOS  also  completed  retranslation  booklets. 

Procedures  for  Batch  B.  Because  of  the  low  return  rate  from  mailing  out 
retranslation  materials  for  Batch  A,  we. modified  the  procedures  for  obtain- 


ing  retranslation  ratings  for  the  Batch  B  MOS.  Non-commissioned  officers 
from  six  locations  were  asked  to  participate  in  the  Batch  B  performance 
incident  workshops.  The  first  four  workshops  were  conducted  in  the  same 
manner  as  those  for  Batch  A  MOS;  participants  spent  a  majority  of  their 
time  generating  incidents,  with  an  hour  or  two  spent  discussing  the  criti¬ 
cal  performance  categories  comprising  the  job.  At  the  final  two  workshops, 
conducted  at  Fort  Sill  and  Fort  Bliss,  participants  spent  the  first  two 
hours  generating  performance  incidents  describing  MOS-specific  job  be¬ 
haviors,  then  spent  the  remainder  of  their  day  completing  retranslation 
booklets. 

Retranslation  materials  administered  in  these  sessions  were  very  similar  to 
those  administered  to  Batch  A  participants.  That  is,  for  each  MOS  we 
constructed  retranslation  booklets  that  contained  about  200  to  270  perfor¬ 
mance  incidents.  Thus,  retranslation  materials  for  each  Batch  B  MOS  in¬ 
cluded  from  two  to  three  booklets  that  contained  unique  performance  in¬ 
cidents.  (Retranslation  materials  administered  to  Batch  B  MOS  appear  in 
Section  2  of  the  separate  appendices.) 

During  the  final  two  workshop  sessions,  we  asked  participants  to  complete 
as  many  retranslation  booklets  as  possible.  In  general,  participants 
completed  about  one-and-one-half  to  two  booklets.  Also  during  this  ses¬ 
sion,  participants  were  asked  to  retranslate  the  performance  incidents 
generated  earlier  during  that  session.  Hence,  we  obtained  retranslation 
ratings  for  all  performance  incidents  generated  at  the  first  four  workshops 
and  for  the  new  incidents  generated  at  that  particular  workshop. 

Results  from  Retranslation  Ratings 


Table?  summarizes  the  number  of  ratings  obtained  from  the  retranslation 
exercise  for  Batch  A  and  Batch  B.  This  table  indicates  again  that  we 
obtained  a  greater  number  of  incidents  for  Batch  A  MOS  than  for  Batch  B 
MOS.  The  average  number  of  ratings  per  retranslation  booklet  varied  for 
the  nine  MOS,  ranging  from  7.6  for  Military  Police  (95B)  to  19.0  for 
Infantryman  (HB).  In  general,  we  obtained  about  nine  or  ten  ratings  for 
each  performance  incident  contained  in  the  retranslation  exercise. 

As  noted  above,  individuals  completing  the  retranslation  exercise  were 
asked  to  read  each  performance  incident  and  provide  two  ratings:  (1)  assign 
the  incident  to  a  performance  dimension  based  on  the  behavior  depicted  in 
the  incident,  and  (2)  rate  the  effectiveness  of  the  behavior  using  a  scale 
of  1  for  ineffective  performance  to  9  for  effective  performance  (a  value  of 
5  on  this  scale  represents  average  performance). 

Analysis  of  the  retranslation  data  was  conducted  separately  for  each  MOS. 
This  included  computing  for  each'  incident:  (1)  the  number  of  raters;  (2) 
percent  agreement  among  raters  in  assigning  incidents  to  performance  dimen¬ 
sions;  (3)  mean  effectiveness  rating;  and  (4)  standard  deviation  of  the 
effectiveness  ratings.  Percent  agreement  values,  mean  effectiveness  rat¬ 
ings,  and  standard  deviations  are  provided  for  all  performance  incidents 
included  in  the  retranslation  exercise  in  Section  3  of  the  MOS  appendices. 
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Table  7 

Retranslation  Exercise:  Number  of  Forms  Developed 

for  Each  MOS  and  Average  Number  of  Raters  Completing  Each  Form 


Number  of  Forjns 


Batch  A 


Batch  B 


Average  Number  of 
Incidents/Form 
(Total  Number  of 
Incidents 


171  (684) 
191  (955) 
190  (760) 
229  (1145) 


274  (548) 

201  (603) 

235  (705) 
230  (690) 

210  (630) 


Average  Number 
of  Raters/Form 
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Development  of  Behavi orally  Anchored  Rating  Scales 


The  next  step  in  the  process  involved  identifying  those  performance  in¬ 
cidents  in  which  raters  agreed  fairly  well  on  performance  dimension  as¬ 
signment  and  effectiveness  level.  For  each  MOS,  we  identified  performance 
incidents  that  met  the  following  criteria:  (1)  at  least  50%  of  the  raters 
agreed  that  the  incident  depicted  performance  in  a  single  performance 
dimension;  and  (2)  the  standard  deviation  of  the  mean  effectiveness  rating 
did  not  exceed  2.0. 

We  then  sorted  these  incidents  into  their  assigned  performance  dimensions. 
Results  from  this  sorting  are  presented  for  each  MOS  in  Tables  8  through  16 
and  are  discussed  in  detail  in  the  next  section  of  this  chapter.  The 
performance  dimensions  listed  in  these  tables  were  the  ones  used  by  raters 
in  the  retranslation  exercise;  they  do  not  necessarily  reflect  the  perfor¬ 
mance  dimensions  administered  in  the  field  test  sessions  described  in 
Chapter  2. 

After  all  incidents  had  been  sorted  into  performance  dimensions,  we  ex¬ 
amined  the  incidents  and  the  percentage  agreement  values  in  each  dimension. 
Recall  that  previously  we  had  identified  all  performance  incidents  for 
which  at  least  50%  of  the  raters  agreed  in  dimension  assignment.  We  care¬ 
fully  reviewed  those  incidents  witfi  percentage  agreement  at  the  50%  level 
to  identify  performance  dimensions  that  raters  found  confusing  or  difficult 
to  distinguish  one  from  another.  For  example,  most  raters  for  the  Armor 
Crewman  (19E)  MOS  agreed  that  incidents  describing  tank  hull  or  tank  turret 
system  maintenance  should  be  assigned  to  either  "Maintaining  tank/hull 
suspension  system  and  associated  equipment"  (Dimension  A)  or  "Maintaining 
tank  turret/fire  control  system"  (Dimension  B)  (see  Table  13).  It  appeared 
that  tank  maintenance  activities  could  not  be  clearly  distinguished  by  tank 
component,  so  these  two  performance  dimensions  were  combined  into  one. 

After  evaluating  our  performance  dimension  systems  and  modifying  them  using 
results  from  the  retranslation  exercise,  we  began  developing  behavioral 
anchors  for  each  dimension.  This  involved  sorting  performance  incidents 
into  three  effectiveness-level  categories--effective  performance  with  mean 
values  of  6.5  or  higher,  average  performance  with  mean  values  of  3.5  to 
6.4,  and  ineffective  performance  with  mean  values  of  1.0  to  3.4.  We  re¬ 
viewed  the  content  of  the  incidents  in  each  of  these  three  areas  and  then 
summarized  the  information  in  each  to  form  three  behavioral  anchors  depict¬ 
ing  effective,  average,  and  ineffective  performance. 

It  is  important  to  note  that  for  each  MOS  we  developed  Behavioral  Summary 
Scales.  Traditional  behaviorally  anchored  rating  scales  contain  specific 
examples  of  job  behaviors  for  each  effectiveness  level  in  a  performance 
dimension.  Behavioral  Summary  Scales,  on  the  other  hand,  contain  anchors 
that  represent  the  behavioral  content  of  all  performance  incidents  reliably 
retranslated  for  that  particular  level  of  effectiveness.  This  makes  it 
more  likely  that  a  rater  using  the  scales  will  be  able  to  match  observed 
performance  with  performance  on  the  rating  scale  (Borman,  1979).  A  sample 
of  one  behavioral  summary  scale  constructed  for  one  MOS,  Military  Police 
(95B),  is  presented  in  Figure  2. 
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A.  TRAFFIC  CONTROL  AND  ENFORCEMENT 


Controlling  traffic  and  enforcing  traffic  laws  and  parking  rules.  , 

1 


1  2 


3  4  5 


6  7 


•  Often  uses  hand/arm 
signals  that  are  dif¬ 
ficult  to  understand, 
at  times  resulting 
in  unnecessary  acci¬ 
dents;  often  fails  to 
wear  reflector! zed 
gear;  overlooks 
hazardous  traffic 
conditions;  sleeps 
on  duty;  pays  exces¬ 
sive  attention  to 
things  unrelated  to 
the  job. 


•  Usually  does  a  rea¬ 
sonable  job  when  di¬ 
recting  traffic  by 
using  adequate  hand/ 
arm  signals  and/or 
wearing  reflector! zed 
gear. 


•  Consistently  uses 
appropriate  hand/ 
arm  signals;  always 
wears  reflector! zed 
gear;  generally 
monitors  traffic 
from  plain-view 
vantage  points; 
consistently  re¬ 
frains  from  behav¬ 
iors  such  as  reading 
and  prolonged  con¬ 
versation  on  non¬ 
job  related  topics. 


•  May  display  excess 
leniency  or  harsh¬ 
ness  when  citing  of¬ 
fenders,  allowing 
their  military  rank, 
race,  and/or  sex  to 
influence  his/her 
actions;  makes  many 
errors  when  filling 
out  citations. 


•  Makes  few  errors 
when  filling  out 
citations;  usually 
does  not  allow  an 
offender’s  race, 
sex,  and/or 
military  rank  to 
interfere  with 
good  judgment. 


•  Always  uses  emergency 
equipment  (e.g., 
flares,  barricades) 
to  highlight  unsafe 
conditions  and  en¬ 
sures  that  hazards 
are  removed  or  other¬ 
wise  taken  care  of. 


Figure  2.  Sample  ;  shavioral  Summary  Rating  Scale  for  Military  Police  (95B) 
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It  is  evident  from  Tables  8  through  16  that  some  performance  dimensions 
contained  a  small  number  of  reliably  sorted  incidents.  When  this  occurred, 
we  reconsidered  including  that  performance  dimension  in  the  rating  scales. 
For  some  MOS,  these  dimensions  were  omitted  or,  where  appropriate,  com¬ 
bined  with  another  performance  dimension.  To  combine  these  dimensions  with 
other  dimensions,  we  examined  the  percentage  agreement  values  to  determine 
whether  or  not  raters  confused  the  dimension  in  question  with  another 
performance  dimension.  In  some  cases,  we  retained  the  performance  dimen¬ 
sion  because  it  represented  requirements  that,  although  performed  infre¬ 
quently,  are  critical  for  success  on  the  job.  Behavioral  anchors  for  such 
dimensions  were  developed  by  extrapolating  information  from  available  per¬ 
formance  incidents. 

After  developing  the  performance  rating  scales  for  each  MOS,  we  submitted 
the  scales  for  review,  generally  by  a  PORI  research  staff  member  familiar 
with  the  development  process.  Results  from  this  review  were  used  to  clari¬ 
fy  performance  definitions  and  behavioral  anchors.  The  final  set  of  per¬ 
formance  rating  scales  administered  in  field  test  sessions  are  included  in 
the  MOS  appendices.  Section  4. 

Results  and  Revisions 

Below  we  describe  results  from  the  retranslation  data  for  each  MOS  and  the 
modifications  made  to  the  scales. 

Cannon  Crewman  (13B1.  For  the  retranslation  exercise,  10  performance 
dimensions  were  identified  from  the  performance  incidents  collected.  Re¬ 
sults  from  the  retranslation  exercise  indicate  that  the  number  of  incidents 
reliably  sorted  into  these  dimensions  ranged  from  14  to  195  (see  Table  8). 
Most  incidents  appeared  for  "Driving  and  maintaining  vehicles.  Howitzers, 
and  equipment"  (Dimension  B)  and  "Transporting/  sorting/storing  and  pre¬ 
paring  ammunition  for  fire"  (Dimension  C).  Although  only  a  small  number  of 
incidents  were  reliably  sorted  into  "Receiving  and  relaying  communications" 
(Dimension  H)  and  "Position  improvement"  (Dimension  0),  these  dimensions 
were  retained  because  they  represent  important  activities  in  the  Cannon 
Crewman  MOS. 

The  final  set  of  rating  scales  contains  all  of  the  ten  original  performance 
dimensions.  They  appear  as  follows:  A.  Loading  out  equipment;  B.  Driving 
and  maintaining  vehicles.  Howitzers,  and  equipment;  C.  Transporting/sort¬ 
ing/  storing  and  preparing  ammunition  for  fire;  D.  Preparing  for  occupa¬ 
tion/  emplacing  Howitzer;  E.  Setting  up  communications;  F.  Gunnery;  G. 
Loading/  unloading  Howitzer;  H.  Receiving  and  relaying  communications;  I. 
Recording/  record  keeping;  and  0.  Position  improvement.  (See  Appendix  A, 
Section  4  for  complete  scale  definitions  and  anchors.) 

Motor  Transport  Operator  (64C1.  A  sorting  of  the  performance  incidents 
revealed  that  10  dimensions  described  the  job  requirements  for  this  MOS. 

The  number  of  incidents  reliably  sorted  into  each  dimension  ranged  from  15 
to  181  (see  Table  9).  Dimensions  containing  the  largest  number  of  reliably 
sorted  incidents  include  "Checking  and  maintaining  vehicles"  (Dimension  C) 
and  "Driving  vehicles"  (Dimension  A).  Although  one  dimension,  "Performing 
dispatcher  duties"  (Dimension  J),  contains  a  small  number  of  incidents. 
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Cannon  Crewman  fl3Bk  Number  of  Behavioral  Examples 


Reliably  Retranslated  Into  Each  Dimension^ 


Number  of 


Dimension  Examples 

A.  Loading  out  equipment  49 

B.  Driving  and  maintaining  vehicles.  Howitzers,  and  equipment  195 

C.  Transporting/sorting/storing  and  preparing  ammunition  for  fire  108 

D.  Preparing  for  occupation  and  emplacing  Howitzer  44 

E.  Setting  up  communications  24 

F.  Gunnery  99 

G.  Loading/unloading  Howitzer  32 

H.  Receiving  and  relaying  communications  19 

I.  Recording/record  keeping  29 

0.  Position  improvement  _14 

Total  Number  613 


^Examples  were  retained  if  they  were  sorted  into  a  single  dimension  by 
greater  than  50%  of  the  retranslation  raters  and  had  standard  deviations 
of  their  effectiveness  ratings  of  less  than  2.0. 
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Motor  Transport  Operator  f64Cl:  Number  of  Behavioral  Examples 


Reliably  Retranslated  Into  Each  Dimension^ 


Number  of 
Examples 


Dimension 

A.  Driving  vehicles 

B.  Vehicle  coupling 

C.  Checking  and  maintaining  vehicles 

D.  Using  maps/following  proper  routes 

E.  Loading  cargo  and  transporting  personnel 

F.  Parking  and  securing  vehicles 

G.  Performing  administrative  duties 

H.  Self -recovering  vehicles 

I.  Safety-mi ndedness 

J.  Performing  dispatcher  duties 
Total  Number 


^Examples  were  retained  if  they  were  sorted  into  a  single  dimen¬ 
sion  by  greater  than  50%  of  the  retranslation  raters  and  had 
standard  deviations  of  their  effectiveness  ratings  of  less  than 
2.0. 
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this  was  retained  because  it  represents  an  important  requirement  of  the 
Motor  Transport  Operator  position. 

The  final  set  of  10  rating  scales  includes:  A.  Driving  vehicles;  B.  Vehi¬ 
cle  coupling;  C.  Checking  and  maintaining  vehicles;  D.  Using  maps/following 
proper  routes;  E.  Loading  cargo  and  transporting  personnel;  F.  Parking  and 
securing  vehicles;  G.  Performing  administrative  duties;  H.  Self-recovering 
vehicles;  I.  Safety-mi ndedness;  and  J.  Performing  dispatcher  duties.  (See 
Appendix  B,  Section  4  for  complete  scale  definitions  and  anchors.) 

Administrative  Specialist  (71L1.  For  the  retranslation  exercise,  we  de¬ 
rived  13  performance  dimensions  from  a  sorting  of  the  performance  inci¬ 
dents.  The  number  of  incidents  reliably  sorted  into  each  ranged  from  2  to 
183  (see  Table  10).  Dimensions  containing  the  largest  number  of  incidents 
include  "Preparing,  typing,  and  proofreading  documents”  (Dimension  A)  and 
"Keeping  records"  (Dimension  F). 

We  modified  the  performance  dimension  system  after  reviewing  the  retransla¬ 
tion  results.  First,  we  decided  to  drop  Dimensions  I  through  M.  "Pre¬ 
paring  special  reports,  document  drafts,  or  other  materials"  (Dimension  I) 
was  deleted  because  it  described  skills  and  activities  more  frequently 
performed  by  only  the  most  experienced  first-termers  and  by  second-termers. 
Dimensions  J  through  M  were  omitted  because  they  involve  job  requirements 
for  a  subset  of  incumbents  within  the  71L  position--71L  F5  or  Postal  Clerk. 
These  dimensions  were  identified  very  early  in  the  workshop  sessions  and  we 
encouraged  participants  to  generate  behavioral  examples  of  these  activi¬ 
ties,  when  possible.  It  is  clear  from  the  retranslation  data,  however, 
that  very  few  participants  generated  examples  describing  these  duties 
and/or  very  few  incidents  were  reliably  sorted  into  these  performance  cate¬ 
gories.  Therefore,  we  decided  to  omit  these  dimensions. 

The  final  set  of  Administrative  Specialist  rating  scales  includes:  A. 
Preparing,  typing,  and  proofreading  documents;  B.  Distributing  and  dis¬ 
patching  incoming  and  outgoing  documents;  C.  Maintaining  office  resources; 
D.  Posting  regulations;  E.  Establishing  and/or  maintaining  files  lAW  TAFFS; 
F.  Keeping  records;  G.  Safeguarding  and  monitoring  security  of  classified 
documents;  and  H.  Providing  customer  service.  (See  Appendix  C,  Section  4 
for  complete  scale  definitions  and  anchors.) 

Mi 1 i tarv  Police  ( 95B) .  A  content  analysis  of  the  performance  incidents 
revealed  that  seven  dimensions  effectively  represented  the  requirements  for 
this  MOS.  The  number  of  incidents  reliably  sorted  into  these  dimensions 
ranged  from  50  to  236  (see  Table  11).  Dimensions  containing  the  largest 
number  of  incidents  are  "Patrolling  and  crime/accident  prevention  activi¬ 
ties"  (Dimension  D)  and  "Making  arrests,  gathering  information  on  criminal 
activity,  and  reporting  on  crimes"  (Dimension  C). 

We  modified  the  performance  dimensions  only  slightly;  we  shortened  dimen¬ 
sion  titles.  The  final  set  of  performance  dimensions  appears  as  follows: 

A.  Traffic  control  and  enforcement;  B.  Providing  security;  C.  Investigating 
crimes  and  making  arrests;  D.  Patrolling;  E.  Promoting  the  public  image  of 
the  Military  Police;  F.  Interpersonal  communication  skills;  and  G.  Respon¬ 
ding  to  medical  emergencies.  (See  Appendix  D,  Section  4  for  complete  scale 
definitions  and  anchors.) 
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Table  10 

Administrative  Specialist  f71Ll:  Number  of  Behavioral  Examples 
Reliably  Retranslated  Into  Each  Dimension^ 

Number  of 


Dimension  Examples 

A.  Preparing,  typing,  and  proofreading  documents  183 

B.  Distributing  and  dispatching  incoming/outgoing  documents  63 

C.  Maintaining  office  resources  73 

D.  Posting  regulations  44 

E.  Establishing  and/or  maintaining  files  lAW  TAPES  50 

F.  Keeping  records  94 

G.  Safeguarding  and  monitoring  security  of  classified  documents  43 

H.  Providing  customer  service  30 

I.  Preparing  special  reports,  document  drafts,  or  other  materials  19 

J.  Sorting,  routing  and  distributing  incoming/outgoing  mail  28 

K.  Maintaining  Army  Post  Office  equipment  2 

L.  Keeping  Post  Office  records  20 

M.  Maintaining  security  of  mail  9 

Total  Number  658 


^Examples  were  retained  if  they  were  sorted  into  a  single  dimension  by 
greater  than  50%  of  the  retranslation  raters  and  had  standard  deviations 
of  their  effectiveness  ratings  of  less  than  2.0. 


I 


Table  11 


Military  Police  (95B):  Number  of  Behavioral  Examples 


Reliably  Retranslated  Into  Each  Dimension^ 


Number  of 
Examples 


Dimension  Examol 

A.  Traffic  control  and  enforcement  on  post  and  in  the  field  63 

B.  Providing  escort  security  and  physical  security  128 

C.  Making  arrests,  gathering  information  on  criminal  activity, 

and  reporting  on  crimes  173 

D.  Patrolling  and  crime/accident  prevention  activities  236 

E.  Promoting  confidence  in  the  military  police  by  maintaining 
personal  and  legal  standards  and  through  community  service  work  118 

F.  Using  interpersonal  communication  (IPC)  skills  87 

G.  Responding  to  medical  emergencies  and  other  emergencies  of 

a  non-criminal  nature  _50 

Total  Number  855 


^Examples  were  retained  if  they  were  sorted  into  a  single  dimension  by 
greater  than  50%  of  the  retranslation  raters  and  had  standard  deviations 
of  their  effectiveness  ratings  of  less  than  2.0. 


Infantryman  fllB).  For  the  retranslation  exercise,  13  performance  dimen¬ 
sions  were  identified  through  a  content  analysis  of  the  performance  inci¬ 
dents.  Results  from  this  exercise  revealed  that  raters  reliably  sorted 
from  5  to  91  incidents  into  each  performance  dimension  (see  Table  12).  The 
greatest  numbers  of  incidents  were  reliably  sorted  into  "Demonstrating 
proficiency  in  the  use  of  all  weapons,  armaments,  equipment,  and  supplies" 
(Dimension  E)  and  in  "Perform  guard  and  security  duties"  (Dimension  K). 

An  examination  of  the  percent  agreement  values  indicated  that  raters  fre¬ 
quently  confused  "Using  weapons  safely"  (Dimension  D)  and  "Demonstrating 
proficiency  in  the  use  of  all  weapons,  armaments,  equipment,  and  supplies" 
(Dimension  E).  Therefore,  we  decided  to  combine  these  two  to  form  a  single 
dimension,  "Use  of  weapons  and  other  equipment." 

We  decided  to  retain  one  of  the  dimensions  that  contained  only  a  few 
performance  incidents,  "Demonstrating  courage  and  proficiency  in  engaging 
the  enemy"  (Dimension  L),  because  it  represented  a  critical  Infantryman 
activity. 

The  only  modification  made  to  the  remaining  performance  dimensions  involved 
renaming  them;  virtually  all  dimensions  received  new  titles.  We  labeled 
the  final  set  of  12  dimensions  as  follows:  A.  Maintaining  supplies,  equip¬ 
ment,  and  weapons;  B.  Assisting  and  leading  others;  C.  Navigation;  D.  Use 
of  weapons  and  other  equipment;  E.  Field  sanitation,  personal  hygiene,  and 
safety;  F.  Fighting  position;  G.  Avoiding  enemy  detection;  H.  Operating  a 
radio;  I.  Reconnaissance  and  patrol;  J.  Guard  and  security  duties;  K. 
Courage  and  proficiency  in  battle;  and  L.  Prisoners  of  war.  (See  Appendix 
E,  Section  4  for  complete  scale  definitions  and  anchors.) 

Armor  Crewman  n9El.  A  content  analysis  of  the  performance  incidents 
revealed  that  11  performance  dimensions  described  the  major  components  of 
the  Armor  Crewman  job  (see  Table  13).  Retranslation  raters  reliably  sorted 
from  11  to  123  incidents  into  each  dimension.  The  largest  numbers  of 
incidents  appeared  in  "Maintaining  tank,  hull/suspension  system  and  as¬ 
sociated  equipment"  (Dimension  A)  and  "Oriving/recovering  tanks"  (Dimen¬ 
sion  C). 

We  modified  the  performance  dimension  system  using  results  from  the  re¬ 
translation  exercise.  First,  agreement  values  for  Dimensions  A  and  B 
indicated  that  raters  frequently  confused  these  two.  Therefore,  we  decided 
to  combine  the  two  to  form  a  single  dimension,  "Maintaining  tank,  tank 
systems,  and  associated  equipment."  For  similar  reasons  "Establ ishing 
security  in  the  field"  (Dimension  I)  and  "Preparing/securing  tanks"  (Di¬ 
mension  K)  were  combined  to  form  a  single  dimension,  "Preparing  tanks  for 
field  problems."  Finally,  we  decided  to  omit  "Navigating"  (Dimension  J), 
because  it  contained  only  a  few  incidents  and  because  this  dimension  ap¬ 
peared  to  represent  job  responsibilities  required  of  more  experienced  or 
higher  ranking  soldiers. 

The  final  set  of  rating  scales  contains  8  performance  dimensions.  These 
include:  A.  Maintaining  tank,  tank  systems  and  associated  equipment;  B. 
Driving/recovering  tanks;  C.  Stowing  ammunition  aboard  tanks;  D.  Load¬ 
ing/unloading  guns;  E.  Maintaining  guns;  F.  Engaging  targets  with  tank 


Infantryman  (IIB):  Number  of  Behavioral  Examples 


Reliably  Retranslated  Into  Each  Dimension^ 

9 


Number  of 

Dimension  Examples 

A.  Ensuring  that  all  supplies  and  equipment  are  field-ready 

and  available  and  well -maintained  in  the  field  73 

B.  Providing  leadership  and/or  taking  charge  in  combat  situations  33 

C.  Navigating  and  surviving  in  the  field  53 

D.  Using  weapons  safely  38 

E.  Demonstrating  proficiency  in  the  use  of  all  weapons,  armaments, 

equipment,  and  supplies  91 


F.  Maintaining  sanitary  conditions,  personal  hygiene,  and 

personal  safety  in  the  field  24 

6.  Preparing  a  fighting  position  29 

H.  Avoiding  enemy  detection  during  movement  and  in  established 

defensive  positions  22 

I.  Operating  a  radio  27 

J.  Performing  reconnaissance  and  patrol  activities  37 


K.  Performing  guard  and  security  duties 


75 


L.  Demonstrating  courage  and  proficiency  in  engaging  the  enemy  5 

M.  Guarding  and  processing  POWs  and  enemy  casualties  _15 

Total  Number  522 


^Examples  were  retained  if  they  were  sorted  into  a  single  dimension  by 
greater  than  50%  of  the  retranslation  raters  and  had  standard  deviations 
of  their  effectiveness  ratings  of  less  than  2.0. 


Armor  Crewman  fl9E^:  Number  of  Behavioral  Examples 
Reliably  Retranslated  Into  Each  Dimension^ 


Dimension 

A.  Maintaining  tank  hull/suspension  system  and 
associated  equipment 

B.  Maintaining  tank  turret  system/fire  control  system 

C.  Driving/recovering  tanks 

D.  Stowing  and  handling  ammunition 

E.  Loading/unloading  guns 

F.  Maintaining  guns 

G.  Engaging  targets  with  tank  guns 

H.  Operating  and  maintaining  communication  equipment 

I.  Establishing  security  in  the  field 

J.  Navigating 

K.  Preparing/securing  tank 
Total  Number 


Number  of 
Examples 

123 

37 

80 

39 

30 

43 

45 

36 

33 

11 

27 

504 


H 

I 


guns;  G.  Operating  and  maintaining  communications  equipment;  and  H.  Prepar¬ 
ing  tanks  for  field  problems.  (See  Appendix  F,  Section  4  for  complete  j 

scale  definitions  and  anchors.)  i 

Radio  Teletype  Operator  f31Cl.  Initially,  we  identified  seven  performance  ! 

dimensions  to  represent  the  job  requirements  for  this  MOS.  Results  from 

the  retranslation  exercise  indicate  that  raters  reliably  sorted  from  33  to  ! 

162  incidents  into  each  dimension  (see  Table  14).  The  greatest  numbers  of  I 

incidents  appeared  in  "Installing  and  preparing  equipment  for  operation"  : 

(Dimension  C)  and  "Operating  communications  devices  and  providing  for  an 

accurate  and  timely  flow  of  information"  (Dimension  D).  j 

I 

We  made  one  change  in  the  performance  dimension  system.  Results  from  the  i 

retranslation  exercise  indicated  that  raters  frequently  confused  two  of  the  ! 

dimensions,  "Inspecting  equipment  and  troubleshooting  problems"  (Dimension  l 

A)  and  "Pulling  preventative  maintenance  and  servicing  equipment"  (Dimen-  ^ 

Sion  B).  Hence,  we  combined  these  two  into  a  single  dimension,  "Inspecting  I 

and  servicing  equipment."  In  addition,  we  renamed  some  of  the  performance  i 

dimensions.  | 

i 

The  final  set  of  rating  scales  contains  the  following  six  performance  I 

dimensions:  A.  Inspecting  and  servicing  equipment;  B.  Installing  and  re¬ 
pairing  equipment;  C.  Operating  communications  devices;  D.  Preparing  re¬ 
ports;  E.  Maintaining  security;  and  F.  Providing  safe  transportation.  (See 
Appendix  G,  Section  4  for  complete  scale  definitions  and  anchors.) 

Liaht-Wheel  Vehicle  Mechanic  (63B1.  For  the  retranslation  exercise,  we 
identified  11  performance  dimensions  that  represent  the  important  require¬ 
ments  of  the  mechanic  position.  Retranslation  raters  reliably  sorted  from 
15  to  101  incidents  into  each  dimension,  with  the  greatest  numbers  appear¬ 
ing  in  "Repair"  (Dimension  D),  and  "Safety-mi ndedness"  (Dimension  K)  (see 
Table  T5). 

Performance  rating  scales  developed  for  the  field  test  included  all  11 
original  dimensions.  We  reasoned  that  although  "Vehicle  and  equipment 
operation"  (Dimension  G)  and  "Planning/organizing  jobs"  (Dimension  I)  con¬ 
tained  a  small  number  of  incidents,  these  activities  represented  important 
components  of  the  mechanic  position.  The  only  modification  made  to  the 
scales  involved  reordering  the  final  four  dimensions.  The  final  set  of 
performance  dimensions  appears  as  follows:  A.  Inspecting  and  testing  prob¬ 
lems  with  equipment;  B.  Troubleshooting;  C.  Performing  routine  maintenance; 

D.  Repair;  E.  Using  tools  and  test  equipment;  F.  Using  technical  documents; 

G.  Vehicle  and  equipment  operation;  H.  Safety  mindedness;  I.  Administrative 
duties;  J.  Planning  and  organizing  jobs;  and  K.  Recovery.  (See  Appendix  H, 

Section  4  for  complete  scale  definitions  and  anchors.) 

Medical  Specialist  I91A1.  The  original  system  contained  11  performance 
dimensions.  The  number  of  incidents  reliably  sorted  into  each  dimension 
ranged  from  11  to  142  (see  Table  16),  The  greatest  numbers  of  incidents 
appeared  in  "Responding  to  emergency  situations"  (Dimension  J),  and  "Pro¬ 
viding  routine  and  ongoing  patient  care"  (Dimension  I). 


Modifications  for  the  field  test  included  deleting  two  performance  dimen¬ 
sions.  We  omitted  one  dimension,  "Attending  to  patient’s  concerns"  (Dimen- 


Number  of 
Examples 


Table  14 

Radio  Teletype  Operator  f31Cl:  Number  of  Behavioral  Examples 


Reliably  Retranslated  Into  Each  Dimension® 


Dimension 

A.  Inspecting  equipment  and  troubleshooting  problems 

B.  Pulling  preventative  maintenance  and  servicing  equipment 

C.  Installing  and  preparing  equipment  for  operation 

D.  Operating  communications  devices  and  providing  for  an 
accurate  and  timely  flow  of  information 

E.  Preparing  reports 

F.  Maintaining  security  of  equipment  and  information 

6.  Locating  and  providing  safe  transport  of  equipment  to  sites 
Total  Number 


^Examples  were  retained  if  they  were  sorted  into  a  single  dimension  by 
greater  than  50%  of  the  retranslation  raters  and  had  standard  deviations 
of  their  effectiveness  ratings  of  less  than  2.0, 


Table  15 


Lioht-Wheel  Vehicle  Mechanic  (63B1:  Number  of  Behavioral  Examples 
Reliably  Retranslated  Into  Each  Dimension^ 

Number  of 

Dimension  Examples 

A.  Inspecting,  testing,  and  detecting  problems  with  equipment  47 

B.  Troubleshooting  63 

C.  Performing  routine  maintenance  23 

D.  Repair  101 

E.  Using  tools  and  test  equipment  68 

F.  Using  technical  documentation  56 

6.  Vehicle  and  equipment  operation  18 

H.  Recovery  36 

I.  Planning/organizing  jobs  15 

J.  Administrative  duties  41 

K.  Safety  mindedness 

Total  Number  557 


^Examples  were  retained  if  they  were  sorted  into  a  single  dimension  by 
greater  than  50%  of  the  retranslation  raters  and  had  standard  deviations 


of  their  effectiveness  ratings  of  less  than  2.0, 


Table  16 


Number  of 
Examples 


Medical  Specialist  (91A1:  Number  of  Behavioral  Examples 


Reliably  Retranslated  Into  Each  Dimension® 


Dimension 

A.  Maintaining  and  operating  Army  vehicles 

B.  Maintaining  accountability  of  medical  supplies  and  equipment 

C.  Keeping  medical  records 

D.  Attending  to  patients’  concerns 

E.  Providing  accurate  diagnoses  in  a  clinic,  hospital, 
or  field  setting 

F.  Arranging  for  transportation  and/or  transporting  injured 
personnel 

G.  Dispensing  medications 

H.  Preparing  and  inspecting  field  site  or  clinic  facilities 
in  the  field 

I.  Providing  routine  and  ongoing  patient  care 

J.  Responding  to  emergency  situations 

K.  Providing  instruction  to  Army  personnel 
Total  Number 


^Examples  were  retained  if  they  were  sorted  into  a  single  dimension  by 
greater  than  50%  of  the  retranslation  raters  and  had  standard  deviations 
of  their  effectiveness  ratings  of  less  than  2.0, 


Sion  D),  because  this  particular  activity  appeared  important  for  success  in 
many  of  the  performance  dimensions.  A  second  dimension,  "Providing  ac¬ 
curate  diagnosis  in  a  clinic,  hospital,  or  field  setting"  (Dimension  E), 
was  omitted  because  it  represented  duties  required  of  more  experienced  or 
higher  ranking  soldiers. 

The  final  set  of  rating  scales  contains  nine  performance  dimensions.  These 
include:  A.  Maintaining  and  operating  Army  medical  vehicles  and  equipment; 
B.  Maintaining  accountability  of  medical  supplies  and  equipment;  C.  Keeping 
medical  records;  D.  Arranging  transportation  and/or  transporting  injured 
personnel;  E.  Dispensing  medications;  F.  Preparing  and  inspecting  field 
site  or  clinic  facilities;  G.  Providing  routine  and  ongoing  patient  care; 

H.  Responding  to  emergency  situations;  and  I.  Providing  health  care  and 
health  maintenance  instruction  to  Army  personnel.  (See  Appendix  I,  Section 
4  for  complete  scale  definitions  and  anchors.) 


Preparation  for  Field  Test 

In  sum,  we  relied  on  results  from  the  retranslation  exercise  to  evaluate 
and  modify  the  performance  dimension  system  for  each  MOS.  Further,  we 
generated  behavioral  anchors  for  each  of  the  performance  dimensions  using 
results  from  our  analysis  of  the  retranslation  ratings. 

The  final  set  of  behaviorally  anchored  rating  scales  for  the  nine  MOS,  as 
described  in  the  preceding  section,  contains  from  6  to  12  performance 
dimensions.  Each  of  the  performance  dimensions  includes  behavioral  anchors 
describing  ineffective,  average,  and  effective  performance.  Raters  are 
asked  to  use  these  anchors  to  evaluate  ratees  on  a  seven-point  rating  scale 
ranging  from  1  (ineffective  performance)  to  7  (effective  performance). 

Before'  administering  the  rating  scales  in  the  field  test,  we  constructed 
one  additional  rating  scale  for  each  MOS  rating  booklet.  This  scale  asks 
raters  to  evaluate  an  incumbent’s  overall  performance  across  all  MOS- 
specific  performance  dimensions.  This  final  rating  scale  is  virtually  the 
same  for  all  MOS;  it  includes  three  anchors  depicting  ineffective,  average, 
and  effective  performance. 

Finally,  we  constructed  rating  scale  booklets  for  each  MOS  that  provided 
raters  with  performance  dimension  titles,  definitions,  and  behavioral  an¬ 
chors.  ■  We  designed  rating  booklets  such  that  raters  could  evaluate  up  to 
five  ratees  in  each.  The  booklets  themselves  do  not  include  instructions 
for  using  the  scales  to  make  performance  ratings.  Our  plan  was  to  provide 
oral  instructions  during  the  field  test  rating  sessions. 

The  MOS-specific  rating  scale  booklets  ask  raters  to  evaluate  incumbents  on 
several  performance  dimensions  specific  to  the  target  MOS  job  requirements 
and  then  to  consider  the  incumbents’  performance  across  all  MOS-specific 
performance  dimensions  to  arrive  at  an  overall  evaluation. 
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CHAPTER  2:  MOS-SPECIFIC  BEHAVIORALLY  ANCHORED  RATING  SCALES; 
FIELD  TEST  ADMINISTRATION  AND  RESULTS 


Introduction 

Field  test  sessions  were  conducted  separately  for  Batch  A  and  Batch  B  MOS. 
We  administered  rating  scales  to  Batch  A  MOS  during  the  period  of  May 
through  August  1984.  These  sessions  were  conducted  at  three  CONUS  sites 
and  at  two  OCONUS  (Outside  Continental  United  States)  sites.  These  in¬ 
cluded  Fort  Hood,  Texas;  Fort  Polk,  Louisiana;  Fort  Riley,  Kansas;  and  two 
USAREUR  sites  (U.S.  military  posts  located  in  West  Germany). 

Rating  scales  for  Batch  B  MOS  were  field  tested  during  the  period  of 
February  through  April  1985.  Sessions  were  conducted  at  four  CONUS  loca¬ 
tions  and  several  OCONUS  locations.  These  included  Fort  Lewis,  Washington; 
Fort  Polk,  Louisiana;  Fort  Riley,  Kansas;  Fort  Stewart,  Georgia;  and 
USAREUR  locations  in  West  Germany. 

Administration  procedures  for  the  rating  sessions  were  virtually  the  same 
for  the  two  batches.  Before  describing  those  procedures,  we  describe  the 
field  test  set-up  to  provide  the  context  in  which  the  rating  scales  were 
administered. 

At  each  field  test  site,  project  staff  administered  several  job  performance 
and  training  performance  measures  to  first-term  enlistees.  These  measures 
were  divided  into  four  blocks:  (1)  hands-on  tests  of  critical  job  tasks; 

(2)  written  job  knowledge  tests  of  critical  tasks;  (3)  rating  scales  mea¬ 
suring  performance  in  critical  task  areas  both  Army-wide  and  MOS-specific, 
and  performance  on  broad  behavioral  dimensions  both  Army-wide  and  MOS- 
specific;  and  (4)  written  tests  assessing  knowledge  acquired  in  Advanced 
Individual  Training  (AIT).  The  objective  was  to  evaluate  all  training  and 
performance  measures  that  had  been  developed  for  Project  A.  Each  type  of 
measure  was  administered  in  a  four-hour  period.  Thus,  first-term  enlistees 
participating  in  the  field  test  sessions  were  scheduled  to  appear  for  two 
consecutive  days. 

The  general  plan  for  administering  the  four  types  of  performance  measures 
included  scheduling  60  recruits  from  a  particular  MOS  for  the  two-day 
period.  This  group  was  then  divided  into  four  smaller  groups  of  fifteen. 
Over  the  two  day  period  we  rotated  the  four  groups  into  the  four  job 
performance/training  outcome  assessment  blocks.  For  example.  Group  A  began 
by  completing  the  hands-on  test  and  then  attended  the  rating  session  on  Day 
One;  on  Day  Two,  Group  A  attended  the  written  job  knowledge  test  session 
in  the  morning  and  the  written  training  knowledge  test  in  the  afternoon. 
Group  B  began  with  the  written  training  knowledge  test  and  the  hands-on 
test  on  Day  One;  on  Day  Two,  this  group  attended  the  rating  session  in  the 
morning  and  completed  the  written  job  knowledge  test  in  afternoon.  Group  C 
began  with  the  written  job  knowledge  test  and  then  the  written  training 
knowledge  test  on  Day  One;  Day  Two  activities  included  the  hands-on  test 
and  then  the  rating  session.  Finally,  Group  D  began  with  the  written 
training  knowledge  test  and  then  attended  the  rating  session;  on  Day  Two, 


this  group  completed  the  job  knowledge  and  hands-on  tests.  Figure  3  con¬ 
tains  a  sample  schedule  for  one  MOS  at  one  test  site  location,  USAREUR- 
Batch  B. 


The  procedure  described  above  was  modified  to  accommodate  soldiers  from  two 
MOS  attending  the  field  test  session  over  the  same  two  day  period.  In  this 
case,  we  scheduled  30  soldiers  from  each  MOS  and  again  divided  them  into 
four  groups  of  fifteen.  The  four  groups  completed  the  four  performance 
measurement  sessions  on  a  rotational  schedule.  Figure  4  provides  a  sample 
schedule  for  a  field  test  session  that  includes  two  different  MOS  for  the 
same  two-day  period. 

Our  objective  for  all  performance  assessment  sessions  was  to  have  ad¬ 
ministrators  work  closely  with  participants  to  ensure  that  everyone  under¬ 
stood  the  instructions  and  to  uncover  any  problems  with  the  materials  and 
the  procedures.  Specifically,  for  the  rating  sessions,  we  wanted  to  un¬ 
cover  any  problems  with  the  scales  (e.g.,  whether  raters  understand  the 
instructions  for  completing  the  rating  scales,  whether  raters  understand 
the  performance  dimensions  and  are  able  to  use  each  to  evaluate  ratees’ 
performance,  what  type  of  rater  training  is  useful  in  this  setting). 

In  the  next  section,  we  describe  each  sample  participating  in  the  field 
test  sessions  (by  MOS),  and  then  describe  the  procedures  used  to  administer 
the  rating  scales.  To  present  the  context  in  which  the  MOS-specific  be¬ 
havioral  ly  anchored  rating  scales  (BARS)  were  administered,  we  describe  the 
materials  included  in  each  rating  session,  and  the  rater  training  proce¬ 
dures.  Our  focus  throughout  this  report  is,  however,  on  the  MOS-Specific 
BARS,  so  in  the  results  and  discussion  section,  we  deal  exclusively  with 
those  scales.  (Campbell  et  al.,  1986,  document  development  activities  and 
field  test  results  for  hands-on  measures  and  written  job  knowledge  mea¬ 
sures.  Davis,  Davis  &  Joyner,  1985,  document  development  activities  and 
field  test  results  for  job  relevant  training  measures.) 


Method 


Sample 


Before  scheduling  the  field  test  sites,  we  constructed  a  roster  of  possible 
first-term  enlistees  for  each  MOS.  This  roster  was  generated  by  identi¬ 
fying  soldiers  whose  enlistment  date  fell  between  1  April  1982  and  30  June 
1983.  This  period  was  selected  so  that  soldiers  participating  in  the  field 
tests  would  have  from  fifteen  months  up  to  three  years  of  experience  on  the 
job.  For  each  field  test  site,  we  generated  a  list  of  soldiers  for  each 
MOS  whose  entry  date  fell  within  this  period.  (This  information  was  ob¬ 
tained  from  the  World  Wide  Personnel  Locator  Service  compiled  by  the  U.S. 
Army.)  This  list  was  given  to  the  point-of-contact  (POC)  at  each  field 
test  site,  who  was  then  responsible  for  contacting  the  appropriate  units 
and  obtaining  the  designated  number  of  soldiers  from  the  target  MOS  on  the 
scheduled  days. 
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Our  goal  for  Batch  A  MOS  was  to  include  about  150  soldiers  from  each  MOS  in 
the  field  test  sessions.  For  Batch  B,  we  attempted  to  include  about  180 
soldiers  from  each  MOS. 

Table  17  and  Table  18  provide  descriptive  information  for  Batch  A  and  Batch 
B  MOS  soldiers  participating  in  the  field  test  sessions.  A  breakdown  of 
each  MOS  sample  by  location,  gender,  race,  pay  grade,  and  age  is  provided. 

Across  the  nine  MOS,  note  that  for  gender,  three  MOS  samples  contain  no 
females.  Recall  that  13B,  IIB,  and  19E  are  combat  arms  MOS,  and  therefore 
females  are  not  included.  Two  MOS,  71L  and  91A,  contain  a  fairly  high 
percentage  of  females  (50.0%  and  37.7%  respectively).  The  remaining  MOS 
samples  contain  a  much  smaller  proportion  of  females  (64C--7.1%;  95B--2.6%; 
31C--12.8%;  and  63B--6.5%). 

The  method  for  obtaining  information  about  soldiers’  race  or  ethnic  group 
varied  from  Batch  A  to  Batch  B.  As  is  evident  from  Tables  17  and  18, 
participants  from  Batch  A  MOS  were  asked  to  indicate  race  by  checking  (1) 
white,  (2)  black,  (3)  Asian,  (4)  American  Indian,  or  (5)  other.  On 
Table  17,  we  combined  the  numbers  for  Asian  and  American  Indian  with  the 
"other"  category  because  there  were  so  few  in  those  categories.  For  the 
Batch  B  field  test,  we  revised  the  category  system.  Participants  were 
asked  to  indicate  race  or  ethnic  group  membership  using  the  following 
categories;  (1)  white;  (2)  black;  (3)  Hispanic;  and  (4)  other. 

Across  the  nine  MOS,  the  racial  membership  of  our  sample  varies  greatly. 

The  percentage  of  whites  within  each  MOS  ranges  from  50.0  to  91.2  percent. 
For  blacks,  the  percentage  ranges  from  5.3  to  42.0  percent.  For  the 
"other"  category,  the  percentages  range  from  0.7  to  7.3  percent.  Across 
the  five  MOS  in  Batch  B,  the  percentage  of  Hispanics  ranges  from  2.0  to  4.1 
percent. 

Mean  age  values  for  Batch  A  MOS  samples  range  from  21.4  to  22.4  with  a 
median  value  of  21  for  three  MOS  and  22  for  one  MOS.  The  modal  age  is  20. 
For  Batch  B  samples  the  mean  age  ranges  from  22.3  to  23.1,  with  a  median 
value  of  22  for  all  five  MOS.  The  modal  age  for  these  MOS  is  21.  Since 
the  Batch  B  field  test  sessions  were  conducted  six  months  after  the  Batch  A 
sessions,  we  would  expect  Batch  B  MOS  samples  to  be  slightly  older  than 
Batch  A  MOS  samples. 

Across  the  nine  MOS,  the  majority  of  participants  indicated  that  their  pay 
grade  at  the  time  of  testing  was  either  E-3  or  E-4.  The  percentage  of 
soldiers  in  the  E-3  and  E-4  pay  grades  ranges  from  86.1  percent  for  Mili¬ 
tary  Police  (95B)  to  95.5  percent  for  Motor  Transport  Operator  (64C).  A 
smaller  percentage  reported  pay  grades  of  E-1  or  E-2,  in  only  one  MOS, 
Military  Police  (95B),  does  the  total  percentage  for  these  pay  grades 
exceed  10%.  Finally,  a  much  smaller  percentage  of  soldiers  reported  pay 
grades  of  E-5  (2.5%  for  Armor  Crewman  and  1.4%  for  Radio  Teletype 
Operator) . 

The  final  variable,  location,  indicates  the  number  of  soldiers  parti¬ 
cipating  at  each  field  test  site.  In  Batch  A,  soldiers  in  the  Cannon 
Crewman  (13B)  and  the  Motor  Transport  Operator  (64C)  positions  were  ob¬ 
tained  exclusively  from  OCONUS  (USAREUR>  locations.  Administrative  Spe- 
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cialist  {71L)  and  Military  Police  (95B)  samples  were  tested  exclusively  in 
CONUS  locations.  Batch  B  MOS  samples  were  obtained  from  both  CONUS  and 
OCONUS  locations. 


Preparation  for  Rating  Sessions 

Our  plan  for  administering  performance  ratings  included  obtaining  evalua¬ 
tions  from  first-term  enlistees’  colleagues  or  peers  and  from  enlistees’ 
supervisors.  Procedures  for  identifying  an  enlistee’s  peers  and  super¬ 
visors  are  described  below. 

Identifying  Peers.  On  Day  One  of  the  field  test  session,  we  convened  the 
entire  group  of  60  first-term  enlistees  to  describe  the  purpose  of  Project 
A,  the  activities  they  would  be  involved  in  over  the  two  day  period,  and 
how  those  activities  meshed  with  the  goals  of  Project  A. 

Also  at  this  time,  the  soldiers  were  given  an  alphabetized  list  of  recruits 
from  their  MOS  who  were  participating  in  the  field  test  session.  They  were 
asked  to  review  the  list  and  to  identify  as  many  soldiers  as  they  could 
whom  they  had  worked  with  or  knew  well  enough  to  rate  in  several  job 
performance  areas.  We  defined  a  work  colleague  or  peer  as:  (1)  someone 
they  had  known  for  at  least  two  months,  and  (2)  someone  they  had  observed 
performing  on  the  job  on  several  occasions. 

Soldiers  were  first  asked  to  find  their  own  name  on  the  list  and  circle  it. 
Next,  they  were  asked  to  identify  the  soldiers  that  they  knew  by  placing  a 
check  next  to  each  soldier’s  name.  We  asked  them  to  check  off  as  many 
names  on  the  list  as  they  could,  but  we  also  informed  them  that  they  would 
only  be  asked  to  rate,  at  most,  four  of  their  peers,  regardless  of  the 
number  they  reported  knowing. 

We  used  the  information  on  these  lists  to  make  peer  rating  assignments. 

For  the  most  part,  peer  assignments  were  made  via  computer.  A  computer 
program  was  developed  to  randomly  assign  ratees  to  raters  using  the  infor¬ 
mation  soldiers  gave  us  about  individuals  with  whom  they  had  worked  on  the 
job.  To  operate  this  program,  we  first  input  the  information  from  each 
soldier’s  list  indicating  all  enlistees  he/she  reported  knowing  well  enough 
to  evaluate.  The  computer  program  used  this  information  to  assign  ratees 
to  raters.  The  output,  all  things  being  equal,  assigned  each  rater  four 
ratees  or  soldiers  and  assigned  each  ratee  or  soldier  to  four  raters.  The 
goal  was  to  obtain  four  peer  ratings  for  each  soldier  participating  in  the 
field  test  session. 

This  procedure  required  about  one-and-one-half  hours  to  complete.  After 
the  computer  generated  the  rating  assignments,  we  recorded  the  names  of  the 
ratees  on  a  rating  tab  along  with  the  name  of  the  rater.  Because  so  much 
time  was  required  to  perform  these  rating  assignments,  no  rating  sessions 
were  conducted  during  the  morning  session  of  Day  One. 

Identifying  Supervisors.  First-term  enlistees’  supervisors  were  identified 
by  the  POC  or  other  military  personnel  located  at  each  site  or  post.  Our 
goal  was  to  obtain  at  least  two  supervisory  ratings  for  each  enlistee 


attending  the  field  test  sessions.  We  asked  units  from  which  the  first- 
term  enlistees  were  selected  to  identify  the  NCO  directly  responsible  for 
supervising  each  enlistee  as  well  as  the  NCO  or  officer  serving  as  the 
second-line  supervisor  for  each  enlistee. 

Thus,  when  we  tested  60  soldiers  from  an  MOS  at  a  particular  post,  it  was 
possible  to  have  as  many  as  120  supervisors  scheduled  to  evaluate  their 
performance.  In  most  cases,  however,  supervisors  were  able  to  rate  several 
soldiers.  Supervisor  rating  sessions  were  conducted  with  groups  of  varying 
sizes,  ranging  from  as  few  as  five  to  as  many  as  30  supervisors. 


Procedures  for  Administering  Rating  Scales 

Procedures  followed  for  the  peer  rating  sessions  and  for  the  supervisor 
rating  sessions  were  virtually  identical.  During  each  session,  partici¬ 
pants  were  asked  to  evaluate  ratees  on  Army-wide  tasks  or  tasks  common  to 
all  MOS,  Army-wide  behavioral ly  anchored  rating  scales  (BARS)  representing 
broad  performance  requirements  that  cut  across  all  MOS,  MOS-specific  task 
scales,  and  MOS-specific  BARS.  Participants  were  also  asked  to  complete 
two  questionnaires  designed  to  obtain  information  about  their  job  history 
and  current  job  situation.  (Documentation  of  Army-wide  rating  scale  de¬ 
velopment  activities  has  been  prepared  by  Pulakos  &  Borman,  1986.  Campbell 
et  al.,  1986,  have  documented  information  for  the  MOS-specific  task  rating 
scales.  Olson  &  Borman,  1986,  document  the  development  and  results  for  the 
Army  environment  questionnaire.)  Below  we  describe  the  general  procedures 
for  administering  these  rating  scales. 

Rating  Session.  Administrators  began  each  rating  session  with  a  brief 
review  of  Project  A  and  a  description  of  the  activities  involved  in  the 
rating  session.  Participants  were  again  reminded  that  the  information  they 
provided  would  remain  strictly  confidential  and  would  not  appear  in  their 
permanent  record,  nor  would  anyone  in  the  Army  ever  be  informed  of  how  they 
had  rated  their  peers  or  how  their  peers  had  evaluated  them.  Supervisors 
were  informed  that  their  subordinates  would  never  see  the  ratings  they 
provided  and  that  the  ratings  would  not  appear  in  the  enlistees’  permanent 
files. 

Next,  we  gave  each  participant  a  rating  tab  listing  the  peers  or  sub¬ 
ordinates  they  would  be  rating.  We  asked  them  to  review  the  list  to  make 
sure  that  they  felt  confident  rating  the  job  performance  of  all  persons  on 
their  list.  Participants  were  reminded  that  we  wanted  them  to  only  rate 
soldiers  whom  they;  (1)  had  known  for  at  least  two  months  and  (2)  had 
observed  performing  on  the  job.  Administrators  consulted  with  each  parti¬ 
cipant  who  reported  problems  and  resolved  these  by  finding  a  replacement 
ratee  or  by  simply  deleting  a  ratee  if  no  replacements  were  available. 

Administrators  then  distributed  the  first  rating  scale  booklet.  Before 
participants  began  making  their  ratings,  administrators  provided  guidance 
and  instruction  about  evaluating  job  performance. 

Rater  Training.  Administrators  began  this  part  of  the  rating  session  by 
describing  the  steps  followed  in  developing  the  rating  scales.  They  in¬ 
formed  participants  that  the  behavioral ly  anchored  rating  scales  had  been 


developed  with  the  help  of  NCOs  familiar  with  the  job  or  MOS  in  question. 
That  is,  the  performance  dimensions  and  anchors  had  been  defined  by  indi¬ 
viduals  most  familiar  with  MOS  job  requirements.  Next,  administrators 
explained  how  to  use  the  information  provided  in  the  booklets  to  make  their 
ratings.  This  included  a  discussion  of  the  behavioral  anchors  and  an 
example  of  how  a  rater  should  use  these  anchors  to  evaluate  ratees’  perfor¬ 
mance. 


Finally,  administrators  discussed  four  common  rating  errors  and  ways  to 
avoid  them  when  providing  performance  ratings.  These  errors  included:  (1) 
halo  error,  or  failing  to  consider  a  person’s  strengths  and  weaknesses 
independently  for  each  performance  dimension;  (2)  single-time  error,  or 
basing  one’s  ratings  for  a  person  on  a  single  event,  failing  to  consider 
performance  on  several  occasions;  (3)  stereotype  error,  or  providing  per¬ 
formance  ratings  based  on  appearance,  background,  or  other  characteristics 
unrelated  to  job  performance;  and  (4)  same-level -of-effectiveness  error,  or 
failing  to  distinguish  between  two  or  more  ratees  on  a  single  performance 
dimension. 


During  this  discussion,  administrators  defined  each  type  of  error  and 
provided  a  relevant  example  of  how  it  might  occur.  They  emphasized  that 
participants  should  rely  on  their  observations  of  each  ratee  and  avoid 
considering  other  unrelated  factors.  Participants  were  encouraged  to  ask 
questions  about  rating  procedures  and  to  obtain  clarification  on  how  to 
avoid  the  common  rating  errors. 


At  the  end  of  this  discussion,  administrators  explained  the  procedures  for 
recording  ratings  in  the  booklets  and  indicated  that  they  would  review  the 
ratings  as  participants  progressed  through  the  booklet  answering  any  ques¬ 
tions  and  dealing  with  any  problems  that  might  arise. 


We  had'  three  objectives  for  the  rater  training  session.  First,  we  wanted 
to  ensure  that  all  participants  understood  the  instructions  and  knew  how  to 
record  their  ratings  in  the  booklet.  Second,  we  wanted  to  make  sure  that 
participants  understood  the  rationale  behind  the  behaviorally  anchored 
rating  scales,  so  that  all  raters  would  be  using  the  same  "frame  of  refer¬ 
ence"  or  standards  to  evaluate  ratees’  performance.  And  third,  we  wanted 
to  ensure  that  raters  understood  the  importance  of  reading  performance 
dimension  definitions  and  anchors,  and  carefully  considering  the  job  per¬ 
formance  behaviors  they  had  observed,  BEFORE  evaluating  ratees’  perfor¬ 
mance.  ■ 


We  explored  the  effects  of  different  types  of  training  during  the  field 
test  sessions.  Information  about  the  different  types  of  rater  training 
programs  and  their  impact  on  peer  and  supervisor  ratings  are  presented  in 
Pulakos  and  Borman  (1986)  and  Pulakos  (1986). 


Administering  the  Remaining  Scales.  For  the  other  rating  scales  included 
in  the  workshops,  administrators  followed  essentially  the  same  procedures. 
They  described  how  the  scales  had  been  developed  and  the  procedures  for 
recording  ratings  on  the  form  or  in  the  booklet  provided.  Further,  raters 
were  reminded  that  they  should  try  to  avoid  making  the  common  rating  er¬ 
rors,  and  that  because  the  ratings  were  for  research  purposes  only,  they 
should  be  as  candid  as  possible  in  making  their  ratings. 
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Computino  Rating  Scores.  Ratings  collected  during  the  field  test  sessions 
were  pooled  across  locations  for  each  MOS.  For  example,  ratings  collected 
for  the  Armor  Crewman  position  at  the  five  test  sites--Fort  Lewis,  Fort 
Polk,  Fort  Riley,  Fort  Stewart,  and  USAREUR--were  combined  and  analyzed  as 
a  single  unit. 

One  apparent  problem  with  the  ratings  surfaced  when  we  compared  mean  rat¬ 
ings  for  a  single  ratee  provided  by  two  or  more  raters.  Although  raters 
appeared  to  agree  on  a  particular  ratee’ s  strengths  and  weaknesses  across 
the  different  performance  dimensions,  level  differences  in  mean  ratings 
appeared.  Because  we  were  more  interested  in  an  enlistee’s  profile  of 
ratings  across  the  different  performance  dimensions  (i.e.,  a  ratee’s  rela¬ 
tive  strengths  and  weaknesses),  we  decided  to  compute  adjusted  scores  that 
would  reduce  or  eliminate  the  level  differences  between  scores  provided  by 
two  or  more  raters  for  a  single  ratee. 

An  examination  of  the  ratings  provided  by  each  rater  revealed  that  some 
raters  had  failed  to  provide  ratings  for  all  enlistees  on  each  performance 
dimension.  Therefore,  it  was  necessary  to  compute  adjusted  scores  by 
comparing  raters’  evaluations  on  a  single  performance  dimension  rather  than 
across  all  performance  dimensions.  Below  we  describe  the  procedures  de¬ 
veloped  to  compute  adjusted  ratings  or  scores;  we  include  an  example  for 
one  rater  and  one  performance  dimension  to  demonstrate  how  these  adjust¬ 
ments  were  made. 

•  For  each  rater,  we  identified  the  score  provided  for  one  enlistee 
on  a  single  performance  dimension.  For  example,  Rater  1  gave 
Enlistee  A  a  score  of  4.0  and  Enlistee  B  a  score  of  5.0  on 
Dimension  X. 

t  We  identified  all  other  peer  and  supervisor  raters  providing 
evaluations  for  the  same  enlistees  on  that  same  performance 
dimension  as  the  target  rater.  For  each  enlistee,  we  computed 
the  mean  rating  across  all  raters.  In  our  example.  Raters  2,  3, 
and  4  evaluated  enlistee  A  on  Dimension  X;  we  computed  the  mean 
rating  for  enlistee  A  across  these  three  raters,  for  a  mean  of 
5.3.  Only  two  raters.  Raters  3  and  4,  evaluated  Enlistee  B  on 
Dimension  X;  we  calculated  the  mean  rating  for  Raters  3  and  4  for 
Enlistee  B;  for  a  mean  of  5.5. 

•  We  then  compared  the  score  for  the  target  rater-enlistee  pair 
with  the  mean  computed  for  the  same  enlistee  across  all  other 
raters.  These  values  were  used  to  compute  a  mean  difference 
score  for  the  target  rater-enlistee  pair.  Continuing  with  our 
example.  Rater  1  gave  Enlistee  A  a  rating  of  4.0  while  the  other 
three  raters  evaluating  Enlistee  A  provided  a  mean  rating  of  5.3. 
Thus  Rater  1  would  receive  a  difference  score  of  -1.3  for  En¬ 
listee  A  on  Dimension  X. 


•  This  procedure  was  repeated  to  compute  a  difference  score  for 

each  rater-enlistee  combination  on  each  performance  dimension.  1 

Values  for  Enlistee  B  are  5.0  for  Rater  1  and  5.5  for  Raters  3  ' 

and  4,  giving  Rater  1  a  mean  difference  score  of  -0.5  for  En-  I 

listee  B  on  Dimension  X.  I 

•  For  each  target  rater-enlistee  pair,  we  identified  a  value  for  j 

weighting  the  difference  score.  In  our  example,  Rater  1  has  a  I 

difference  score  of  -1.3  for  Enlistee  A  and  -0.5  for  Enlistee  B.  * 

We  weighted  each  score  using  the  number  of  other  raters  evaluat-  I 

ing  each  enlistee.  So,  in  this  example  the  mean  difference  score  I 

for  Enlistee  A  is  weighted  3  because  three  other  raters  evaluated  j 

this  enlistee.  The  mean  difference  score  for  Enlistee  B  is  j 

weighted  2.  \ 

j 

•  For  each  rater,  we  computed  a  weighted  average  difference  score 

for  each  performance  dimension.  For  Dimension  X,  Rater  1  re-  | 

ceived  a  weighted  average  difference  score  of  -1.0  [i.e.,  (3  j 

(-1.3)  +  2  {-0.5))/5].  j 

i 

•  Finally,  an  average  difference  score  was  computed  across  all  i 

performance  dimensions  for  that  rater.  The  average  difference 

score  was  then  used  to  adjust  all  ratings  provided  by  the  target  I 

rater.  For  Rater  1  the  average  across  all  performance  dimensions 
Is  -1.2.  Therefore,  all  ratings  provided  by  Rater  1  were  in-  I 

creased  by  a  value  of  1.2. 

The  above  procedures  were  used  to  compute  adjusted  scores  for  all  raters. 

Ratings  supplied  by  peers  and  supervisors  were  pooled  to  compute  adjusted 
scores. 

Screening  the  Rating  Data.  The  next  step  in  the  analyses  involved  screen¬ 
ing  the  data  to  identify  ratings  that  appeared  unrealistic  or  did  not 
correspond  to  other  ratings  provided  for  the  same  ratee.  Because  "true" 
performance  scores  were  not  available,  we  evaluated  the  data  by  comparing 
information  provided  by  one  rater  with  information  provided  by  all  other 
raters  evaluating  the  same  enlistee(s).  Two  criteria  for  identifying 
questionable  raters  were  developed. 

t  First,  we  computed  the  correlation  between  performance  dimension 
ratings  for  a  target  rater-enlistee  pair  and  the  mean  performance 
dimension  ratings  provided  by  all  other  raters  evaluating  that 
enlistee.  If  this  correlation  was  -.2  or  lower  for  any  enlistee, 
all  of  the  rater’s  ratings  were  deleted  from  the  data  set. 

•  Second,  we  examined  each  rater’s  average  difference  score  used  to 
make  the  rating  score  adjustments.  Any  rater  that  obtained  an 
average  difference  score  of  2.0  or  greater  in  absolute  value  was 
deleted  from  the  sample. 

For  any  rater  whose  adjusted  scores  met  one  or  both  of  the  above  screening 
criteria,  all  ratings  provided  by  that  rater  were  deleted  from  the  data 
set.  Thus,  for  one  discrepant  rater,  we  may  have  eliminated  one  or  more 
ratees.  This  number  varied  according  to  the  number  of  soldiers  evaluated 
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by  the  discrepant  rater. 


Our  goal  for  eliminating  raters  was  to  be  as  conservative  as  possible  by 
deleting  only  the  most  extreme  ratings.  As  a  result,  very  few  ratees  were 
deleted  from  the  data  set.  For  each  of  the  MOS  by  rater  type  (supervisors 
or  peers)  data  sets,  the  number  of  ratees  deleted  from  set  ranges  from  zero 
to  seven.  Across  all  MOS  and  rater  types,  data  were  eliminated  for  only  22 
ratees. 

Subsequent  Analyses.  For  all  remaining  analyses,  we  analyzed  ratings 
provided  by  supervisors  separately  from  ratings  provided  by  peers.  Using 
the  adjusted  scores  computed  for  each  rater,  we  computed  a  mean  performance 
dimension  score  for  each  ratee.  These  mean  values  were  used  to  compute  the 
mean,  standard  deviation,  and  range  of  scores  across  all  ratees  for  each 
performance  dimension. 

We  computed  the  intraclass  correlation  between  ratings  provided  for  the 
same  enlistees  to  estimate  the  degree  of  interrater  reliability  on  each 
performance  dimension.  Next,  intercorrelations  between  performance  dimen¬ 
sion  ratings  provided  by  peers  and  between  performance  dimension  ratings 
provided  by  supervisors  were  computed.  Intercorrelations  between  peer  and 
supervisor  ratings  were  also  computed.  We  present  and  discuss  these  data 
separately  for  each  MOS  in  the  "Results"  section. 

Differences  Between  Batch  A  and  Batch  B  Data  Sets.  Before  presenting  these 
data,  however,  we  must  call  attention  to  some  differences  between  the 
adjusted  rating  scores  computed  for  Batch  A  MOS  and  Batch  B  MOS. 

First,  recall  that  for  all  MOS,  raters  used  a  scale  of  1  (low)  to  7  (high) 
to  evaluate  ratees.  These  "raw"  ratings  were  then  adjusted  for  level 
differences  between  raters,  using  the  procedure  described  above.  This 
procedure  provided  some  adjusted  scores  that  fell  outside  the  actual  range 
of  rating  scale  values;  for  example,  the  rating  scores  for  one  performance 
dimension  ranged  from  0.49  to  7.17.  In  the  analyses  of  Batch  A  MOS  rat¬ 
ings,  we  allowed  the  adjusted  values  to  exceed  the  actual  scale  point 
range.  For  Batch  B  MOS,  we  modified  the  adjusted  scores  so  that  the  range 
of  adjusted  values  would  correspond  to  the  range  of  "raw"  values  (i.e.,  all 
scores  would  fall  within  a  range  of  1  to  7);  this  was  accomplished  by 
truncating  adjusted  scores  that  exceeded  7,0  or  that  fell  below  1,0.  In 
the  following  tables,  then,  the  ratings  for  Batch  A  exceed  the  range  of  1 
to  7,  Whereas  ratings  for  Batch  B  MOS  fall  within  this  range. 

Another  difference  in  the  analyses  performed  for  the  two  batches  of  MOS 
involves  the  assumptions  made  in  computing  the  interrater  reliability 
estimates  for  peers.  Since  the  goal  was  to  obtain  four  peer  ratings  for 
each  enlistee,  in  computing  the  interrater  reliability  coefficients  for 
peer  ratings  obtained  for  Batch  A  MOS  we  assumed  four  raters  per  ratee. 
When  computing  these  values  for  Batch  B  MOS,  we  first  computed  the  average 
number  of  peer  raters  per  ratee.  This  information  led  us  to  modify  our 
assumption  about  the  average  number  of  raters,  so  for  Batch  B  MOS  inter¬ 
rater  reliability  estimates  were  computed  assuming  three  raters  per  ratee. 

Interrater  reliability  estimates  computed  for  peer  ratings  provided  for 
Batch  A  MOS  samples  can  be  interpreted  as  the  expected  correlation  between 


(1)  the  mean  ratings  provided  for  soldiers  by  their  peers  in  this  sample 
and  (2)  the  mean  ratings  that  would  be  provided  for  the  same  soldiers  by  an 
equivalent  group  of  peers,  assuming  that  all  soldiers  were  rated  by  four 
peers.  "Equivalent"  indicates  any  peer  who  meets  the  two  criteria  for 
rating  a  soldier. 

Interpretation  of  interrater  reliability  estimates  computed  for  peer  rat¬ 
ings  provided  for  Batch  B  MOS  samples  is  similar  to  the  interpretation  for 
Batch  A  MOS,  except  that  we  assume  that  three  rather  than  four  peers 
provided  ratings  for  Batch  B. 

For  a?7  MOS,  interrater  reliability  estimates  computed  for  supervisors  can 
be  interpreted  as  the  expected  correlation  between  (1)  the  mean  ratings 
provided  for  soldiers  by  their  supervisors  in  this  sample  and  (2)  the  mean 
ratings  that  would  be  provided  for  the  same  soldiers  by  an  equivalent  group 
of  supervisors,  assuming  that  all  soldiers  were  rated  by  two  supervisors. 

By  "equivalent,"  we  mean  any  supervisor  who  meets  the  two  criteria  for 
rating  a  soldier. 

Assumptions  concerning  the  number  of  raters  evaluating  each  soldier  affect 
the  resulting  reliability  estimate.  The  more  raters  evaluating  a  soldier, 
generally,  the  higher  the  estimate.  For  the  field  test  data,  then,  we 
would  expect  higher  interrater  reliability  estimates  for  ratings  provided 
by  peers  than  by  supervisors,  and  higher  reliability  estimates  for  ratings 
provided  by  peers  in  Batch  A  MOS  than  by  peers  in  Batch  B  MOS. 


Results 

For  each  group  of  ratings,  we  had  calculated  the  ratio  of  the  number  of 
raters  to  the  number  of  ratees.  These  data,  reported  in  Table  19,  are 
presented  separately  for  each  MOS  and  for  supervisor  and  peer  ratings.  For 
comparison,  we  have  included  ratios  for  rating  data  computed  before  and 
after  the  ratings  were  screened.  Note  that  these  ratios  change  very  little 
following  the  screening  process. 

For  supervisors,  the  "after"  ratios  range  from  1.04  for  Administrative  Spe¬ 
cialist  (711)  to  1.88  for  Military  Police  (95B)  with  a  median  value  of 
1.73.  These  data  indicate  that  for  a  majority  of  enlistees  in  each  MOS,  we 
obtained  ratings  from  two  supervisors.  Within  the  Administrative  Spe¬ 
cialist  MOS,  however,  we  obtained  an  average  of  only  one  supervisor  rating 
for  each  enlistee. 

For  peer  ratings,  the  "after"  ratio  of  raters  to  ratees  ranges  from  1.89 
for  Administrative  Specialist  (71L)  to  3.39  for  Military  Police  (95B)  with 
a  median  value  of  2.57.  Thus,  we  obtained  at  least  two  peer  ratings  for 
every  enlistee  with  the  exception,  of  Administrative  Specialist  enlistees. 
For  erlistees  in  four  of  the  MOS,  Military  Police  (95B),  Infantryman  (IIB), 
Armor  Crewman  (19E),  and  Medical  Specialists  (91A),  we  obtained  about  three 
peer  ratings  for  each. 

On  the  following  pages,  we  describe  the  results  for  each  MOS  individually. 
For  each  rater  group  (i.e.,  supervisors  and  peers),  we  report  the  range  of 
adjusted  ratings,  mean,  and  standard  deviation  for  each  performance  dimen- 


Table  19 

Ratio  of  Raters  to  Ratees  Before  and  After  Screeninc 


for  Supervisor  and  Peer  Ratings 


MOS 

Before 

After 

Before 

After 

13B 

-  Cannon  Crewman 

1.47 

1.47 

2.89 

2.52 

64C 

-  Motor  Transport  Operator 

1.84 

1.82 

2.77 

2.57 

71L 

-  Administrative  Specialist 

1.04 

1.04 

1.90 

1.89 

95B 

-  Military  Police 

1.94 

1.88 

3.67 

3.39 

IIB 

-  Infantryman 

1.81 

1.81 

2.99 

2.99 

19E 

-  Armor  Crewman 

1.68 

1.68 

2.95 

2.95 

31C 

-  Radio  Teletype  Operator 

1.73 

1.73 

2.49 

2.50 

63  B 

-  Light -Wheel  Vehicle  Mechanic 

1.77 

1.77 

2.08 

2.09 

k 


I 
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91A  -  Medical  Specialist 


ff 


1.59 


1.59 


3.10 


3.10 
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Sion  as  well  as  the  grand  mean  across  all  performance  dimensions  and 
ratees.  For  comparison,  the  text  includes  the  grand  mean  computed  across 
unadjusted  ratings  (this  value  does  not  appear  in  the  tables).  We  also 
focus  on  the  interrater  reliability  estimates  (Rj^v)  intercorrela¬ 

tions  between  performance  dimension  ratings  provided  by  peers  and  by  super- 
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Cannon  Crewman  -  13B 

We  collected  performance  information  on  a  total  of  150  first-term  enlistees 
from  the  Cannon  Crewman  MOS.  Table  20  presents  the  means,  standard  devia¬ 
tions,  range  of  scores  and  interrater  reliability  estimates  for  supervisors 
and  peers. 

Complete  supervisor  rating  data  were  collected  for  140  enlistees.  Focusing 
on  those  ratings,  adjusted  ratings  range  from  0.65  to  7.76.  Mean  adjusted 
performance  dimension  values  range  from  4.48  to  5.19  (standard  deviations 
range  from  1.03  to  1.31).  The  grand  mean,  computed  across  all  enlistees 
and  all  performance  dimensions,  using  the  adjusted  ratings,  is  4.89 
(SD=0.81).  The  unadjusted  grand  mean  value  is  4.89  (SD=1.13).  Interrater 
reliability  estimates  range  from  .33  (J.  Position  improvement)  to  .70  (K. 
Overall  performance)  with  a  median  value  of  .45. 

Ratings  provided  by  peers,  for  140  enlistees,  adjusted  for  level  dif¬ 
ferences,  range  from  0.76  to  8.87.  Mean  adjusted  ratings  across  the  11 
performance  dimensions  range  from  4.47  to  5.05  and  the  standard  deviations 
range  from  0.80  to  1.22.  The  grand  mean  value  computed  for  adjusted  scores 
is  4.85  (SD=0.71);  the  grand  mean  for  unadjusted  values  is  4.89  (SD=0.84). 
Reliability  estimates  range  from  .40  (H.  Receiving  and  relaying  communica¬ 
tions)  to  .66  (G.  Loading/unloading  Howitzer)  with  a  median  value  of  .54. 

Table  21  presents  the  intercorrelation  matrix  for  the  supervisor  and  peer 
ratings.  For  supervisors  alone,  correlations  between  the  dimension  rat¬ 
ings  (excluding  Overall  performance)  range  from  .19  to  .70  with  a  mean 
value  of  .46  (50=0.12).  Examination  of  the  Overall  ratings  provided  by 
supervisors  indicates  that  "Gunnery"  (Dimension  F),  "Position  improvement" 
(Dimension  J),  and  "Loading/unloading  Howitzer"  (Dimension  G)  correlate 
highest  with  this  rating. 

Correlations  between  dimension  ratings  provided  by  peers  (excluding  Overall 
performance)  range  from  .36  to  .62  with  a  mean  of  .50  (SD=0.07).  For 
peers,  "Gunnery"  (Dimension  F),  "Recording/  record  keeping"  (Dimension  I), 
and  "Position  improvement"  (Dimension  J)  correlate  highest  with  the  Overall 
rating. 

Intercorrelations  between  dimension  ratings  provided  by  supervisors  and  by 
peers  (excluding  Overall  performance)  range  from  .15  to  .53.  The  degree  of 
agreement  between  peers  and  supervisors  is  more  apparent  from  the  values  in 
the  diagonal  of  this  matrix  (e.g.,  peer  ratings  on  Dimension  A  correlated 
with  supervisor  ratings  on  Dimension  A).  Correlations  between  supervisor 
and  peer  ratings  on  the  11  performance  dimensions  range  from  .18  (E.  Set¬ 
ting  Up  Communications)  to  .54  (D.  Preparing  for  occupation/emplacing 
Howitzer)  with  a  median  value  of  .39. 
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Cannon  Crewman  (13B)  HOS-Specific  BARS  -  Supervisors  and  Peers 


f  "ui  .“CKTUiniiRJUW  w  vjwvjwyv  WWW  ^  w  u  *  w  « 'f  w  u  f.  ‘i  if  ji  y?^  iLf  il*.  up.  iw  v.  jw. 


<Nj  r- 


sT  fn 


*-  >0  m  «o 
*0  «<r  m 


•“ 

lA 


s 


o  >r  (M  ^  m 

>r  lA  to  «•» 


lA  «.r  to 

lA  O  >0 
'  ^  to 

\  00 


eo  ^  o^  •-  N>  Ow 

lA  fA  lA  lA  ' 


lA  (A 


>o 

lA 


O  N.  •- 
lA  fA  lA 


h- 


eo  c>  eo 
rsi  AJ  >r 


K>  >  00 
^  Ki  ^ 


^  O  <M 
'O  rA  '4’ 


fA 

O' 


««4  lA  •“ 
^  >4“  rA 


s»^  «•  K 
CM  »A  AJ 


lA  K)  - 

N.  to  , 
<4  ^4  *41  Al 


N*  «4 
Al  (A 


•-  O' 

CM  lA 


COj  Ai 
^1  AJ 

^  lA 


00  K|  At  "» 

•“  CAI  *4 

fO|  Kt  «-  •- 

rAl  rA  «4 


fw  O'  CM 
lA  rA  «4 


'O  >r 

*4  -4 


«—  N>  >0 

>4  4  »4 


'O  h-  «4  *-  CM  O' 
'T  rA  CM  CM  «4  rA 


^0‘'OOrAlA^O*4 

'0'0>41A'41A'4'4'4 


'O  eo  N> 

*4  sr  CM 


k.  B  V 

O  <  r^ 

E  4-»  M 

Q.  O  •»-  C  c- 

u  5  o  V 

3  O  O  •*-  N 

«•*  X  4-* 

UJ  </)  <0  O' 

_  O  O  3 

C  w  U  •-  O 

C  i.  <0  C  X 

'  <0  o  w  3 

4-*  VI  Q.  E  *0 

C  E  S  <0 

4^  UJ  O  O 

3  <0  t-  's.  o  —» 

O'  ac  o  «  ^  c 

liJ  ''s,  ^  t-  &  U  3 

O  <A  <0  3  o;  >«. 

T3  >  c  a  c  -o 

<Q  (Q  o  4^  C  A 

o  u  u  u  o  3  o 

>j  o  »-  a.  V)  <0  ^ 


c 

Q.  « 


2.  o 

or  4^ 

UJ  V) 


I 


& 


'S.  X  c 

>S.  O 

>  ^  w 

•—  C.  fO 

4)  O  ••-  t. 

U  U  V>  V 

«  «i  o  > 


•-  c 

3  O 
O  ►- 


UJ 


u 

—  ut 
C  C 

i  Q.  i 


8.  2 


u  ^ 
c 

Q.  «-  3 

<o  3  4>  *''k 

c  a  c  *0 

<0  0>  44  c  (0 

L.  t-  a>  3  o 

^  a.  M  o  ^ 


^  ij  mm 

X  a> 

'S.  X  C 

0)  's,  o  W 

••-  C.  4.4  (0 

V  o  •—  c. 

U  U  V)  0) 

0)  o  o  > 

X  X  X  o 


<c»3a.ujx>^</ioxv) 


<  m  u  o  ui  to 


X  UJ  UJ  X  V) 


Motor  Transport  Operator  -  64C 


A  total  of  155  enlistees  from  the  Motor  Transport  Operator  position  parti¬ 
cipated  in  the  field  test  sessions.  Means,  standard  deviations,  range  of 
scores  and  interrater  reliability  estimates  are  presented  in  Table  22  for 
supervisor  and  peer  ratings. 

We  gathered  supervisor  ratings  on  all  performance  dimensions  for  138  of 
these  enlistees.  Across  all  dimensions  supervisor  ratings  adjusted  for 
level  differences,  range  from  0.49  to  7.94.  Mean  adjusted  scores  range 
.from  4.16  to  5.11  (standard  deviations  range  from  0.92  to  1.12).  The  grand 
mean  computed  across  all  enlistees  and  performance  dimensions,  for  the 
adjusted  ratings,  is  5.07  (SD=0.73);  the  grand  mean  computed  for  unad¬ 
justed  ratings  is  4.92  (SD=1.02).  Interrater  reliability  estimates  range 
from  .47  (F.  Parking  and  securing  vehicles)  to  .66  (I.  Safety-mi ndedness 
and  E.  Loading  cargo  and  transporting  personnel)  with  a  median  value 
of  .57. 

The  peer  rating  data  indicate  that  we  obtained  complete  data  for  152  en¬ 
listees.  Adjusted  scores  range  from  0.17  to  8.49.  Mean  adjusted  ratings 
for  individual  performance  dimensions  range  from  3.78  to  5.39  (standard 
deviations  range  from  0.75  to  1.09).  The  grand  mean  computed  for  adjusted 
ratings  provided  for  all  enlistees  across  all  performance  dimensions  is 
4.74  (SD=0.66);  for  unadjusted  ratings  the  grand  mean  is  4.66  (SD=0.83). 
Interrater  reliability  estimates  range  from  .32  (G.  Performing  administra¬ 
tive  duties)  to  .68  (0.  Using  maps/following  proper  routes)  with  a  median 
value  of  .54. 

The  supervisor  and  peer  intercorrelation  matrix  appears  in  Table  23.  Cor¬ 
relations  computed  for  supervisor  ratings  alone  (excluding  Overall  perfor¬ 
mance)"  range  from  .21  to  .65  with  a  mean  of  .48  (SD=0.12).  Correlations 
between  the  final  dimension,  "Overall,”  and  the  other  performance  dimen¬ 
sions  indicate  that  supervisors  placed  the  highest  value  on  "Loading  cargo 
and  transporting  personnel"  (Dimension  E),  "Safety-mindedness"  (Dimension 
I),  and  "Checking  and  maintaining  vehicles"  (Dimension  C) . 

Correlations  computed  between  performance  dimension  ratings  provided  by 
peers  (excluding  Overall  performance)  range  from  .09  to  .69  with  a  mean 
of  .42  (SD=0.16).  For  the  peer  group,  "Driving  vehicles"  (Dimension  A), 
"Safety-mindedness"  (Dimension  I),  and  "Checking  and  maintaining  vehicles" 
(Dimension  C)  correlate  highest  with  the  Overall  performance  rating. 

Intercorrelations  between  supervisor  and  peer  ratings  (excluding  the  Over¬ 
all  rating)  range  from  .06  to  .54.  The  level  of  agreement  between  super¬ 
visor  and  peer  ratings  is  apparent  from  the  11  correlations  highlighted  in 
the  diagonal  of  the  matrix.  These  values  range  from  .20  (J.  Performing 
dispatcher  duties)  to  .53  (C.  Checking  and  maintaining  vehicles)  with  a 
median  of  .46. 
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A  total  of  129  first-termers  from  the  Administrative  Special ist  .MOS  parti¬ 
cipated  in  the  field  test.  Table  24  contains  performance  dimension  means, 
standard  deviations,  range  of  scores,  and  interrater  reliability  estimates 
for  ratings  provided  by  supervisors  and  peers. 

Results  from  the  supervisor  ratings  indicate  that  we  obtained  complete  data 
for  only  95  enlistees.  This  information  suggests  the  unique  circumstances 
surrounding  this  MOS.  First,  enlistees  in  this  MOS  often  work  alone  with 
only  one  NCO,  officer,  or  civilian  providing  daily  or  routine  supervision; 
it  was  difficult  to  locate  two  supervisors  for  each  enlistee.  Second, 
enlistees  performing  as  Administrative  Specialists  perform  some  but  not  all 
duties  delegated  to  this  MOS;  thus,  raters  simply  could  not  rate  enlistees 
on  all  dimensions.  For  this  MOS,  then,  we  generally  obtained  enlistee 
performance  ratings  from  only  one  supervisor.  Only  on  rare  occasions  were 
we  able  to  obtain  two  such  ratings  for  an  enlistee.  (Table  19  indicates 
that  the  ratio  of  raters  to  ratees  is  1.04.)  Therefore,  we  did  not  calcu¬ 
late  interrater  reliability  estimates  for  supervisor  data. 

Results  from  Table  24  indicate  that  values  for  supervisor  ratings  ranged 
from  1.00  to  8.03.  Mean  adjusted  scores  range  from  4.11  to  5.26  (standard 
deviations  range  from  1.13  to  1.44).  The  grand  mean  computed  across  all 
enlistees  and  performance  dimensions,  using  adjusted  ratings,  is  4.52 
(SD=0.94);  the  grand  mean  for  unadjusted  ratings  is  4.56  (SD=1.13). 

Data  for  peer  ratings  indicate  that  we  had  similar  problems  obtaining 
complete  rating  data,  because  soldiers  in  this  MOS  seldom  work  closely  with 
peers.  Thus,  we  obtained  complete  data  for  only  63  enlistees  but  we  did 
collect  a  sufficient  number  of  ratings  to  estimate  reliabilities  for  peer 
rating*  data.  (Table  19  indicates  that  we  obtained  1.89  peer  ratings  for 
each  enlistee.) 

Adjusted  peer  ratings  range  from  1.56  to  7.31.  Mean  adjusted  performance 
dimension  ratings  range  from  4.32  to  5.48  (standard  deviations  range  from 
0.81  to  1.09).  The  grand  mean  computed  across  all  enlistees  and  perfor¬ 
mance  dimensions,  using  adjusted  ratings,  is  4.72  (SD=0.64);  the  grand  mean 
computed  for  unadjusted  ratings  is  4.75  (SD=0.81).  Interrater  reliability 
estimates  range  from  .37  (H.  Providing  customer  service)  to  .55  (G.  Safe¬ 
guarding  and  monitoring  security  of  classified  documents,  and  I.  Overall 
performance)  with  a  median  value  of  .49. 

The  intercorrelation  matrix  for  supervisor  and  peer  ratings  is  provided  in 
Table  25.  For  supervisors  alone,- correlations  between  the  first  eight 
performance  dimensions  (excluding  Overall)  range  from  .15  to  .66  with  a 
mean  of  .42  (SD=0.14).  According  to  the  supervisors,  "Preparing,  typing, 
and  proofreading  documents"  (Dimension  A),  "Distributing  and  dispatching 
incoming  and  outgoing  documents"  (Dimension  B) ,  and  "Providing  customer 
service"  (Dimension  H)  correlate  highest  with  Overall  performance. 

For  peers  alone,  correlations  between  performance  dimension  ratings  (ex¬ 
cluding  Overall  performance)  range  from  .17  to  .62  with  the  mean  equal 
to  .36  (SD=0.11).  According  to  the  peer  ratings,  "Providing  customer 
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service"  (Dimension  H),  "Keeping  records"  (Dimension  F),  and  "Preparing, 
typing  and  proofreading  documents"  (Dimension  A)  correlate  highest  with 
Overall  performance. 

Intercorrelations  between  supervisor  and  peer  ratings  (excluding  correla¬ 
tions  with  the  overall  rating)  range  from  .03  to  .54.  The  10  correlations 
computed  between  supervisor  and  peer  ratings  on  common  performance  dimen¬ 
sions  range  from  .22  (F.  Keeping  records,  and  G.  Safeguarding  and  monitor¬ 
ing  security  of  classified  documents)  to  .51  (I.  Overall  performance)  with 
a  median  value  of  .40. 


w 


Means.  Standard  Deviations.  Ranges,  and  Rel labil itv  Estimates  for 


X 


Military  Police  -  95B 

We  tested  114  Military  Police  enlistees  in  the  field  test  sessions. 

Table  26  contains  performance  dimension  rating  statistics  for  supervisor 
and  peer  ratings.  Note  that  for  both  sets  of  data,  we  obtained  complete 
data  for  nearly  all  subjects  (N=lll). 

Adjusted  ratings  provided  by  supervisors  range  from  1.59  to  7.19.  The 
adjusted  means  computed  for  the  eight  performance  dimensions  range  from 
4.12  to  4.77.  Adjusted  standard  deviations  for  the  mean  ratings  range  from 
0.82  to  1.03.  The  grand  mean  computed  using  the  adjusted  ratings  is  4.47 
(SD=  0.63);  for  unadjusted  ratings  the  grand  mean  is  4.59  (SD=  0.75). 
interrater  reliability  estimates  range  from  .39  (B.  Providing  security) 
to  .74  (H.  Overall  performance)  with  a  median  value  of  .55. 

Peer  ratings,  adjusted  for  level  differences,  range  from  1.88  to  7.19. 
Adjusted  mean  values  computed  for  each  performance  dimension  range  from 
4.19  to  4.75  and  the  standard  deviations  range  from  0.63  to  0.87.  The 
grand  mean  computed  across  all  enlistees  and  all  performance  dimensions, 
using  adjusted  ratings,  is  4.43  (SD=  0.60);  the  grand  mean  computed  for 
unadjusted  ratings  is  4.43  (SD=  0.66).  Interrater  reliability  estimates 
range  from  .39  (B.  Providing  security)  to  .71  (H.  Overall  performance)  with 
a  median  value  of  .65. 

Table  27  contains  the  intercorrelations  for  supervisor  and  peer  ratings. 

For  supervisors  alone,  these  correlations  for  the  seven  performance  dimen¬ 
sions  (excluding  Overall)  range  from  .20  to  .61  with  a  mean  of  .39 
(SD=  0.15).  According  to  supervisors,  "Investigating  crimes/making  ar¬ 
rests"  (Dimension  C),  "Providing  security"  (Dimension  B),  and  "Traffic 
control  and  enforcement"  (Dimension  A)  correlate  highest  with  "Overall 
performance." 

Correlations  between  dimension  ratings  (excluding  Overall)  provided  by 
peers  range  from  .48  to  .72  with  a  mean  of  .58  (SD=  0.07).  According  to 
peers,  "Traffic  control  and  enforcement"  (Dimension  A),  "Patrolling"  (Di¬ 
mension  D),  and  "Promoting  the  public  image  of  the  Military  Police"  (Dimen¬ 
sion  E)  correlate  highest  with  Overall  performance. 

Intercorrelations  between  supervisor  ratings  and  peer  ratings  (excluding 
Overall  performance)  range  from  .24  to  .54.  Correlations  computed  between 
peer  and  supervisor  ratings  on  common  performance  dimensions  range  from  .31 
(G.  Responding  to  medical  emergencies)  to  .55  (H.  Overall  performance)  with 
a  median  value  of  .45. 
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Infantryman  -  IIB 

A  total  of  178  enlistees  from  the  Infantryman  MOS  attended  the  field  test 
sessions.  Table  28  contains  the  means,  standard  deviations,  range  of 
ratings,  and  interrater  reliability  estimates  for  supervisors  and  peers. 
Please  note  that  for  this  and  the  remaining  MOS,  we  computed  adjusted 
ratings  to  remove  level  differences  among  raters.  These  ratings  were 
truncated  so  that  the  range  of  adjusted  scores  is  equivalent  to  the  range 
of  raw  or  unadjusted  scores. 

The  data  in  Table  28  indicate  that  we  obtained  one  or  more  supervisor 
ratings  for  148  enlistees.  Adjusted  ratings  provided  by  supervisors  range 
from  1.22  to  7.00.  Mean  adjusted  values  computed  across  all  ratees  for 
each  performance  dimension  range  from  4.00  to  4.77  (standard  deviations 
range  from  0.85  to  1.10).  The  grand  mean  computed  across  all  enlistees  and 
performance  dimensions  for  adjusted  ratings  is  4.45  (SD=  0.70);  the  grand 
mean  for  unadjusted  ratings  is  4.39  (S0=  0.91).  Interrater  reliability 
estimates  computed  for  each  performance  dimension  range  from  .29  (L.  Pris¬ 
oners  of  war)  to  .63  (A.  Maintaining  supplies,  equipment,  and  weapons)  with 
a  median  value  of  .53. 

For  peer  ratings,  we  obtained  complete  data  for  172  enlistees.  Adjusted 
ratings  provided  by  peers  range  from  1.76  to  7.00.  Mean  adjusted  values 
computed  across  ratees  for  each  performance  dimension  range  from  4.22  to 
4.80;  standard  deviations  range  from  0.74  to  0.98.  The  grand  mean  computed 
across  all  enlistees  and  performance  dimensions,  using  adjusted  ratings,  is 
4.51  (SD=  0.62);  the  grand  mean  for  unadjusted  ratings  is  4.56  (SD=  0.70). 
Interrater  reliability  estimates  range  from  .30  (G.  Avoiding  enemy  detec¬ 
tion)  to  .64  (C.  Navigation)  with  a  median  value  of  .55. 

Intercorrelations  among  supervisor  and  peer  ratings  appear  in  Table  29. 

For  supervisors  alone,  correlations  between  dimensions  (excluding  Overall 
performance)  range  from  .19  to  .65  with  a  mean  of  .42  (SD=  0.10).  Ac¬ 
cording  to  the  supervisors,  "Maintaining  supplies,  equipment,  and  weapons" 
(Dimension  A),  "Assisting  and  leading  others"  (Dimension  B),  and  "Recon¬ 
naissance  and  patrol"  (Dimension  I)  correlate  highest  with  "Overall  perfor¬ 
mance." 

For  peer  ratings  alone,  correlations  for  the  first  12  dimensions  (excluding 
Overall  performance)  range  from  .29  to  .63  with  a  mean  value  of  .50 
(SD=  0.08).  According  to  the  peer  raters,  "Use  of  weapons  and  other  equip¬ 
ment"  (Dimension  D),  "Reconnaissance  and  patrol"  (Dimension  I),  and  "Navi¬ 
gation"  (Dimension  C)  correlate  highest  with  "Overall  performance." 

Intercorrelations  computed  between  supervisor  and  peer  ratings  (excluding 
Overall  performance)  range  from  .11  to  .52.  Correlations  computed  for  peer 
and  supervisor  ratings  on  common  performance  dimensions  range  from  .29  (L. 
Prisoners  of  war)  to  .51  (M.  Overall  performance)  with  a  median  value 
of  .41. 
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Armor  Crewman  -  19E 


We  tested  172  Armor  Crewman  enlistees  during  the  Batch  B  field  test  ses¬ 
sions.  Table  30  presents,  for  supervisor  and  peer  ratings,  means,  standard 
deviations,  range  of  ratings,  and  interrater  reliability  estimates. 

We  obtained  complete  supervisor  rating  data  for  146  of  these  enlistees. 
Adjusted  supervisor  ratings  range  from  1.15  to  7.00.  Mean  adjusted  ratings 
computed  separately  for  each  performance  dimension  range  from  4.35  to  5.23 
(standard  deviations  range  from  0.72  to  1.15).  The  grand  mean  computed 
across  all  enlistees  and  performance  dimensions,  using  the  adjusted  rat¬ 
ings,  is  4.75  (SD=  0.58);  for  unadjusted  ratings  the  grand  mean  is  4.89 
(SD=  0.78).  Interrater  reliability  estimates  computed  for  each  performance 
dimension  range  from  .46  (E.  Maintaining  guns)  to  .73  (F.  Engaging  targets 
with  tank  guns)  with  a  median  value  of  .57. 

We  obtained  complete  peer  rating  data  for  163  Armor  Crewman  enlistees.  The 
adjusted  values  range  from  1.45  to  7.00.  Mean  adjusted  values  computed  for 
each  performance  dimension  range  from  4.38  to  5.01  with  the  standard  devia¬ 
tions  ranging  from  0.71  to  0.98.  The  grand  mean  computed  across  all  en¬ 
listees  and  performance  dimensions,  using  the  adjusted  ratings,  is  4.76 
(SD=  0.56);  the  grand  mean  computed  using  unadjusted  ratings  is  4.75  (SD= 
0.60).  Interrater  reliability  estimates  range  from  .29  (C.  Stowing  ammuni¬ 
tion  aboard  tanks)  to  .65  (I.  Overall  performance)  with  a  median  value 
of  .43. 

Table  31  presents  the  intercorrelations  for  supervisor  and  peer  ratings. 

For  supervisor  ratings  alone,  correlations  for  the  first  eight  performance 
dimensions  (excluding  Overall  performance)  range  from  .09  to  .47  with  a 
mean  value  of  .29  (SD=  0.11).  According  to  supervisors,  "Preparing  tanks 
for  field  problems"  (Dimension  H),  "Maintaining  tank,  tank  systems,  and 
associated  equipment"  (Dimension  A),  and  "Engaging  targets  with  tank  guns" 
(Dimension  F)  correlate  highest  with  "Overall  performance." 

Correlations  between  performance  dimension  ratings  provided  by  peers  (ex¬ 
cluding  Overall  performance)  range  from  .06  to  .51,  with  a  mean  value 
of  .35  (SD=  0.13).  According  to  peers,  "Preparing  tanks  for  field  prob¬ 
lems"  (Dimension  H),  "Engaging  targets  with  tank  guns"  (Dimension  F),  and 
"Stowing  ammunition  aboard  tanks"  (Dimension  C)  correlate  highest  with 
"Overall  performance." 

Intercorrelations  between  peer  and  supervisor  ratings  computed  for  the 
first  eight  performance  dimensions  (excluding  Overall  performance)  range 
from  .02  to  ^42.  Correlations  appearing  in  the  diagonal  of  this  matrix 
range  from  .14  (C.  Stowing  ammunition  aboard  tanks)  to  .42  (F.  Engage 
targets  with  tank  guns)  with  a  median  value  of  .30. 
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Overal I 


In  the  field  test  sessions,  we  assessed  the  performance  of  148  Radio  Tele¬ 
type  Operator  first-term  enlistees.  Means,  standard  deviations,  range  of 
ratings,  and  interrater  reliability  estimates  are  presented  in  table  32. 

According  to  the  information  in  this  table,  we  obtained  complete  supervisor 
rating  data  for  125  of  those  enlistees.  Mean  adjusted  values  computed 
across  all  enlistees  for  each  performance  dimension  range  from  4.26  to  4.93 
(the  standard  deviation  for  these  scores  ranges  from  1.01  to  1.16).  The 
grand  mean  computed  across  all  enlistees  and  performance  dimensions,  using 
adjusted  ratings,  is  4.68  {SD=  0.86);  the  grand  mean  for  unadjusted  ratings 
is  4.46  (SD=  0.93).  Interrater  reliability  estimates  range  from  .57  (C. 
Operating  communications  devices)  to  .70  (G.  Overall  performance)  with  a 
median  value  of  .63. 

From  peers  we  obtained  complete  rating  data  for  120  Radio  Teletype  Operator 
enlistees.  Mean  adjusted  values  computed  for  each  performance  dimension 
range  from  4.38  to  4.91  (standard  deviations  range  from  0.85  to  1.03).  The 
grand  mean  computed  for  adjusted  ratings  is  4.66  (SD=  0.69);  the  grand  mean 
computed  using  unadjusted  ratings  is  4.88  (SD=  0.86).  Interrater  reliabil¬ 
ity  estimates  range  from  .52  (A.  Inspecting  and  servicing  equipment)  to  .69 
(G.  Overall  performance)  with  a  median  value  of  .60. 

Correlations  computed  between  performance  dimension  ratings  provided  by 
supervisors  and  peers  are  shown  in  Table  33.  For  supervisors  alone,  these 
values  range  from  .46  to  .65  with  a  mean  of  .53  (SD*  0.05).  (Values  for 
the  Overall  rating  are  not  included  in  the  range  or  mean  values  above.) 
According  to  supervisors,  "Installing  and  repairing  equipment"  (Dimension 
B),  "Inspecting  and  servicing  equipment"  (Dimension  A),  and  "Providing  safe 
transportation"  (Dimension  F)  are  the  dimensions  most  highly  correlated 
with  "Overall  performance." 

An  examination  of  the  peer  data  indicates  chat  the  correlations  between  the 
first  seven  performance  dimensions  (excluding  Overall)  range  from  .37 
to  .66  with  a  mean  of  .49  (SD=  0.09).  According  to  peers,  "Overall  perfor¬ 
mance"  correlates  highest  with  performance  in  "Installing  and  repairing 
equipment"  (Dimension  B),  "Operating  communications  devices"  (Dimension  C), 
and  "Inspecting  and  servicing  equipment"  (Dimension  A). 

Intercorrelations  computed  between  performance  dimension  ratings  provided 
by  peers  and  by  supervisors  (excluding  Overall  performance)  range  from  .21 
to  .54.  Correlations  between  supervisor  and  peer  ratings  on  common  perfor¬ 
mance  dimensions  range  from  .21  (C.  Operating  communications  devices) 
to  .63  (G.  Overall  performance)  with  a  median  value  of  .43. 
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Light-Wheel  Vehicle  Mechanic  -  63B 

A  total  of  156  Light-Wheel  Vehicle  Mechanic  enlistees  were  tested  in  the 
field  test  sessions.  Data  for  these  sessions  are  summarized  in  Table  34. 


We  obtained  complete  supervisor  rating  data  for  137  of  these  enlistees. 

Mean  adjusted  scores  computed  across  all  enlistees  for  each  performance 
dimension  range  from  3.96  to  4.92  (standard  deviations  for  these  ratings 
range  from  1.03  to  1.23).  The  grand  mean  computed  across  all  enlistees  and 
performance  dimensions,  for  adjusted  ratings,  is  4.48  (SD=  0.87);  the  grand 
mean  computed  for  unadjusted  ratings  is  4.34  (SD=  0.98).  Estimates  of 
interrater  reliability  range  from  .43  (C.  Performing  routine  maintenance) 
to  .67  (L.  Overall  performance)  with  a  median  value  of  .62. 

From  peers  we  obtained  complete  data  for  a  total  of  127  Light-Wheel  Vehicle 
Mechanic  enlistees.  Mean  adjusted  values  computed  for  each  performance 
dimension  range  from  4.11  to  4.92  (standard  deviations  range  from  0.94  to 
1.12).  The  grand  mean  computed  for  adjusted  ratings  is  4.47  (SD=  0.73); 
using  unadjusted  ratings  the  grand  mean  is  4.64  (SD=  0.81).  Interrater 
reliability  estimates  range  from  .35  (K.  Recovery)  to  .70  (C.  Performing 
routine  maintenance)  with  a  median  value  of  .59. 

Table  35  contains  the  intercorrelations  computed  between  performance  dimen¬ 
sion  ratings  for  supervisors  and  peers.  For  supervisors  correlations  among 
the  first  11  performance  dimensions  (excluding  Overall  performance)  range 
from  .31  to  .77  with  a  mean  of  .53  (SD=  .10).  Performance  dimension  rat¬ 
ings  yielding  the  highest  correlations  with  "Overall  performance"  for  the 
supervisor  group  include  "Troubleshooting"  (Dimension  B),  "Performing  rou¬ 
tine  maintenance"  (Dimension  C),  "Inspecting,  testing,  and  detecting  prob¬ 
lems  with  equipment"  (Dimension  A),  and  "Repair"  (Dimension  D). 

Correlations  between  performance  dimension  ratings  provided  by  peers  (ex¬ 
cluding  Overall  performance)  range  from  .08  to  .69  with  a  mean  value  of  .43 
(SD=  0.13).  Peers  agree  with  supervisors  that  "Repair"  (Dimension  D), 
"Troubleshooting"  (Dimension  B),  and  "Inspecting,  testing,  and  detecting 
problems  with  equipment"  (Dimension  A)  correlate  highest  with  "Overall 
performance" . 

Intercorrelations  between  performance  dimension  ratings  provided  by  super¬ 
visors  and  peers  (excluding  Overall)  range  from  .06  to  .57.  Correlations 
in  the  diagonal  of  supervisor-peer  matrix  range  from  .26  (K.  Recovery) 
to  .62  (L.  Overall  performance)  with  a  median  value  of  .45. 
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A  total  of  167  Medical  Specialist  enlistees  were  included  in  the  field  test 
sessions.  Data  for  this  MOS  are  summarized  in  Table  36. 

As  Table  36  indicates,  we  obtained  complete  supervisor  rating  data  from  138 
of  these  enlistees.  Adjusted  mean  scores  computed  across  all  enlistees  for 
each  performance  dimension  range  from  4.39  to  5.17  (standard  deviations  for 
these  values  range  from  0.97  to  1.24).  The  grand  mean  computed  across  all 
enlistees  and  performance  dimensions,  using  adjusted  ratings,  is  4.71  (SD= 
0.79);  for  unadjusted  ratings  the  grand  mean  is  4.71  (SD=  0.83).  Inter¬ 
rater  reliability  estimates  range  from  .45  (G.  Providing  routine  and  on¬ 
going  patient  care)  to  .75  (C.  Keeping  medical  records)  with  a  median  value 
of  .66. 

We  obtained  complete  peer  rating  data  for  148  Medical  Specialists.  Ad¬ 
justed  mean  values  computed  across  all  enlistees  for  each  performance 
dimension  range  from  4.45  to  4.93  (standard  deviations  range  from  0.84  to 
1.03).  The  grand  mean  computed  using  the  adjusted  ratings  is  4.71  (SD= 
0.72);  across  the  unadjusted  ratings  the  grand  mean  is  4.72  (SD=  0.76). 
Interrater  reliability  estimates  computed  for  peers  range  from  .44  (F. 
Preparing  and  inspecting  field  site  or  clinic  facilities)  to  .68  (I.  Pro¬ 
viding  health  care  and  health  maintenance  instructions  to  Army  personnel) 
with  a  median  value  of  .62. 

Correlations  between  performance  dimension  ratings  provided  by  supervisors 
and  peers  are  provided  in  Table  37.  For  supervisors  alone,  values  for  the 
first  nine  dimensions  (excluding  Overall  performance)  range  from  .25  to  .57 
with  a  mean  value  of  .45  (SO*  0.08).  According  to  supervisors,  "Responding 
to  emergency  situations"  (Dimension  H),  "Keeping  medical  records"  (Dimen¬ 
sion  C),  and  "Maintaining  accountability  of  medical  supplies  and  equipment" 
(Dimension  B)  correlate  highest  with  "Overall  performance." 

Focusing  on  peer  rating  data,  correlations  between  ratings  on  the  first 
nine  performance  dimensions  (excluding  Overall  performance)  range  from  .33 
to  .70  with  a  mean  value  of  .53  (SD=  0.09).  According  to  peers,  "Re¬ 
sponding  to  emergency  situations"  (Dimension  H),  "Dispensing  medication" 
(Dimension  E),  and  "Providing  routine  and  ongoing  patient  care"  (Dimension 
G)  correlate  highest  with  "Overall  performance." 

Intercorrelations  among  supervisor  and  peer  ratings  across  all  performance 
dimensions,  excluding  "Overall  performance",  range  from  .18  to  .57.  Cor¬ 
relations  computed  between  supervisor  and  peer  ratings  on  common  perfor¬ 
mance  dimensions  range  from  .29  (F.  Preparing  and  inspecting  field  site  or 
clinic  facilities)  to  .59  (J.  Overall  performance)  with  a  median  value 
of  .43. 
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Analyses  of  the  field  test  data  indicate  that  peers  and  supervisors  pro¬ 
vided  useful  information  about  MOS-specific  job  performance,  with  each 
rater  group  providing  unique  information  about  MOS-specific  job  require¬ 
ments. 

Supervisor  and  peer  ratings  yielded  similar  levels  of  reliability  esti¬ 
mates.  Across  all  MOS,  median  reliability  estimates  for  supervisor  ratings 
range  from  .53  for  Infantryman  (IIB)  to  .66  for  Medical  Specialist  (91A) 
with  a  median  value  of  .57.  For  peer  ratings,  median  values  range  from  .43 
for  Armor  Crewman  (19E)  to  .65  for  Military  Police  (95B)  with  a  median 
value  of  .55.  The  median  values  indicate  that  for  single  item  scales, 
interrater  reliability  estimates  are  at  acceptable  levels.  Median  values 
for  the  two  rater  type  groups  suggest  that  supervisors  are  probably  more 
reliable  than  peers.  Recall  that  assumptions  for  computing  interrater 
reliability  estimates  differed  for  supervisors  and  peers;  we  assumed  three 
or  four  peer  raters  for  each  ratee  and  two  supervisor  raters  for  each 
ratee.  Reported  reliability  estimates  were  adjusted  for  the  number  of 
raters  for  each  ratee.  Given  equal  numbers  of  supervisor  and  peer  raters 
for  each  ratee,  these  data  indicate  that  the  supervisor  ratings  would  be 
somewhat  more  reliable  than  the  peer  ratings. 

Supervisors  and  peers  provided  similar  information  about  the  mean  level  of 
performance.  Across  the  nine  MOS,  peers  provided  slightly  higher  grand 
mean  values  than  supervisors  in  two  MOS,  Administrative  Specialist  (71L) 
and  Infantryman  (IIB).  Supervisors  provided  slightly  higher  grand  mean 
values  than  peers  in  two  MOS,  Motor  Transport  Operator  (64C)  and  Military 
Police  (95B).  Mean  ratings  for  the  two  groups  were  nearly  identical  for 
the  remaining  MOS,  Cannon  Crewman  (13B),  Armor  Crewman  (19E),  Radio  Tele¬ 
type  Operator  (31C),  Light-Wheel  Vehicle  Mechanic  (63B),  and  Medical  Spe¬ 
cialist  (91A). 

Average  intercorrelations  among  performance  dimension  ratings  for  super¬ 
visors  and  peers  are  similar.  For  supervisor  ratings,  the  mean  correlation 
for  the  nine  MOS  ranges  from  .29  for  Armor  Crewman  (19E)  to  .53  for  Radio 
Teletype  Operator  (31C)  and  Light-Wheel  Vehicle  Mechanic  (63B).  For  peer 
ratings,  the  mean  correlation  across  the  nine  MOS  ranges  from  .35  for  Armor 
Crewman  (19E)  to  .58  for  Military  Police  (95B).  The  greatest  difference 
between  mean  correlations  for  supervisors  and  peers  occurs  for  Military 
Police  (95B)  with  the  mean  value  for  supervisors  at  .39  and  mean  value  for 
peers  at  .58. 

For  each  MOS,  we  identified  three  performance  dimensions  ratings  that  in 
the  judgment  of  supervisors  and  peers  correlated  highest  with  the  "Overall 
performance"  rating.  This  information  suggests  how  the  two  rater  groups 
differ  with  respect  to  perceptions  about  requirements  that  lead  to  success 
on  the  job.  Across  the  nine  MOS,  correlations  between  performance  dimen¬ 
sion  ratings  and  the  "Overall  performance"  rating  indicate  that  supervisors 
and  peers  agree  only  moderately  on  the  requirements  that  lead  to  success  on 
the  job. 
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For  four  MOS,  Administrative  Specialist  (71L),  Armor  Crewman  (19E),  Radio 
Teletype  Operator  (31C),  and  Light-Wheel  Vehicle  Mechanic  (63B),  peers  and 
supervisors  agreed  on  two  of  the  three  performance  dimensions  contributing 
most  to  overall  performance.  For  three  MOS,  Cannon  Crewman  (13B),  Military 
Police  (95B),  and  Medical  Specialist  (91A),  supervisors  and  peers  agreed  on 
one  of  three  performance  dimensions.  For  two  MOS,  Motor  Transport  Operator 
(64C)  and  Infantryman  (IIB),  there  was  no  agreement  among  supervisors  and 
peers  concerning  the  performance  dimensions  that  correlate  highest  with 
"Overall  Performance." 

Finally,  correlations  computed  between  supervisor  and  peer  ratings  on 
common  performance  dimensions  reveal  a  moderate  amount  of  agreement  between 
the  two  rater  groups.  Median  correlations  computed  for  each  MOS  range 
from  .30  for  Armor  Crewman  (19E)  to  .46  for  Motor  Transport  Operators 
(64C). 

In  sum,  supervisors  and  peers  provided  performance  ratings  that  were  simi¬ 
lar  in  reliability,  mean  performance  level,  and  average  intercorrelation 
between  performance  dimensions.  Supervisors  and  peers,  however,  appeared 
to  differ  somewhat  in  their  perceptions  of  requirements  that  lead  to  over¬ 
all  success  on  the  job. 
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CHAPTER  3:  PREPARATION  OF  THE  MOS-SPECIFIC  BARS  FOR  ADMINISTRATION 
IN  THE  CONCURRENT  VALIDITY  STUDY 


Prior  to  administering  the  MOS-specific  rating  scales  in  the  Concurrent 
Validity  study,  scale  developers  reviewed  results  from  the  field  test  data 
analyses.  Further,  the  MOS-specific  rating  scales  were  submitted  to  a 
Proponent  review  to  verify  that  critical  first-term  job  requirements  were 
represented  in  t.ie  performance  scales.  In  this  chapter  we  describe  the 
procedures  for  modifying  the  MOS-specific  behaviorally  anchored  rating 
scales,  using  results  from  the  field  test  as  well  as  input  supplied  by  the 
Proponent  review  committee. 


Evaluation  of  Field  Test  Results 

Reliability 

In  Chapter  2,  we  summarized  the  reliability  estimates  computed  for  super¬ 
visor  and  peer  ratings  obtained  from  the  field  test  sessions.  Although  we 
concluded  that,  on  the  average,  single-scale  reliability  estimates  were 
acceptable  for  each  rater  group,  we  were  concerned  that  within  a  particular 
MOS  there  might  be  one  or  two  performance  dimensions  on  which  supervisors 
and  peers  alike  experienced  difficulty  in  evaluating  enlistees.  Consis¬ 
tently  low  reliability  estimates  observed  for  both  rater  groups  on  a  parti¬ 
cular  performance  dimension  might  suggest  that  the  dimension  definition  and 
anchors  were  unclear  or  that  the  dimension  did  not  reflect  a  critical 
component  of  the  job. 

For  earch  MOS,  we  compared  the  reliability  estimates  computed  for  perfor¬ 
mance  dimension  ratings  provided  by  supervisors  with  estimates  for  ratings 
provided  by  peers  to  identify  possible  problem  dimensions.  Table  38  pro¬ 
vides  a  summary  of  the  median  reliability  estimates  as  well  as  the  range  of 
reliabilities  for  each  MOS. 

For  most  MOS,  there  appears  to  be  no  consistent  pattern  when  reliability 
estimates  computed  for  supervisor  ratings  are  compared  with  those  computed 
for  peer  ratings.  In  only  one  MOS,  Military  Police  (95B),  the  pattern  of 
reliability  estimates  for  supervisor  ratings  and  peer  ratings  corresponded 
quite  highly.  Within  that  MOS  one  performance  dimension,  "Providing  secu¬ 
rity"  (Dimension  B),  appeared  to  present  problems  for  both  rater  groups. 

The  interrater  reliability  estimate  computed  separately  for  supervisors  and 
peers  is  the  same  for  both  groups,  .39.  Therefore,  we  reviewed  this  parti¬ 
cular  performance  dimension  to  clarify  the  definition  as  well  as  the  behav¬ 
ioral  anchors. 

For  the  remaining  MOS-specific  rating  scales,  we  identified  performance 
dimensions  with  low  reliability  estimates  computed  for  peer  or  supervisor 
ratings.  We  then  reviewed  rating  scale  definitions  and  anchors  developed 
for  these  dimensions  to  uncover  potential  problems. 


MOS-Soecific  BARS:  Sumnarv  of  ReUabilitv  Estimates  for  Supervisor  and  Peer  Ratin 


Leniency  and  Severity 

As  reported  in  Chapter  2,  we  computed  grand  mean  values  separately  for  peer 
ratings  and  supervisor  ratings;  for  the  two  rater  type  groups  these  mean 
values  are  very  similar.  We  used  these  values  to  assess  leniency  and 
severity  effects.  High  mean  values  indicate  that  raters  may  have  been  too 
lenient  or  "easy"  in  assigning  ratings,  whereas  very  low  mean  values  indi¬ 
cate  that  raters  may  have  been  too  severe  or  strict  in  assigning  ratings. 

Recall  that  the  grand  mean  values  tabulated  in  Chapter  2  were  computed 
using  adjusted  ratings.  Grand  means  computed  using  the  raw  rating  data 
provide  a  more  appropriate  statistic  for  evaluating  ratings  for  leniency  or 
severity  effects.  Table  39  contains  the  grand  mean  values  reported  by  MOS 
and  by  rater  type.  Grand  mean  values  computed  using  both  the  unadjusted 
and  adjusted  ratings  have  been  included  for  comparison  purposes. 

Grand  mean  values  computed  using  adjusted  scores  correspond  very  highly 
with  those  values  computed  using  unadjusted  scores.  For  supervisors  the 
grand  mean  values,  using  unadjusted  ratings,  range  from  4.34  to  4.92;  for 
adjusted  ratings  these  values  range  from  4.48  to  5.07.  For  peers  the  grand 
mean  values  for  unadjusted  ratings  range  from  4.43  to  4.89;  for  adjusted 
ratings  the  values  range  from  4.43  to  4.85.*^ 

Since  the  scale  used  for  making  these  ratings  ranges  from  1  (low  or  inef¬ 
fective  performance)  to  7  (high  or  effective  performance),  one  might  argue 
that  ratings  which  reflect  no  leniency  or  severity  effects  should  be  near 
4.00.  According  to  the  results  from  the  field  test,  grand  means  computed 
across  individual  performance  dimensions  separately  for  each  MOS  and  rater 
type  are  all  above  4.00.  One  might  conclude,  then,  that  these  data  demon¬ 
strate  leniency  effects. 


Cascio  and  Valenzi  (1978),  however,  argue  that  ratings  which  appear  lenient 
might,  in  fact,  accurately  reflect  incumbents’  job  performance,  because 
prior  selection  has  weeded  out  potentially  poor  performers.  Supervisor  and 
peer  ratings  obtained  in  the  field  test  sessions  do  not  appear  overly 
lenient  and  may,  in  fact,  reflect  job  performance  levels  we  would  expect, 
given  that  poorer  performers  have  been  identified  and  screened  out  through 
the  selection  and  classification  process  as  well  as  through  Basic  Training 
and  Advanced  Individual  Training. 


Proponent  Review  Procedures  and  Results 

Following  the  Batch  B  field  test  administration,  each  of  the  nine  MOS- 
specific  behavioral ly  anchored  rating  scales  was  submitted  to  a  Proponent 
committee  for  review.  Proponent  committee  members,  who  were  primarily 
technical  school  subject  matter  experts  from  each  MOS,  studied  the  scales 
and  made  suggestions  for  scale  modifications. 


^Unadjusted  and  unscreened  rating  data  provided  by  supervisors  and  peers 
are  summarized  in  Section  5  of  the  nine  MOS  appendices. 


For  most  MOS,  suggestions  made  by  committee  members  included  minor  wording 
changes.  For  example,  committee  members  noted  a  problem  with  one  of  the 
anchors  in  one  Administrative  Specialist  (71L)  performance  dimension, 
"Keeping  records."  Specifically,  the  committee  recommended  deleting  one 
anchor  from  this  dimension  because  it  described  job  duties  typically  re¬ 
quired  of  second-term  personnel  only  (i.e.,  handle  suspense  dates). 
Therefore,  we  omitted  this  anchor  from  that  performance  dimension. 

For  another  MOS,  Radio  Teletype  Operators  (31C),  the  Proponent  review 
committee  noted  that  the  job  title  had  been  changed.  Therefore,  we  made 
the  necessary  changes  on  all  Concurrent  Validity  study  rating  forms.  The 
current  MOS-Specific  rating  form  for  this  MOS  now  reads  "Single  Channel 
Radio  0perator--31C." 

For  one  MOS,  Military  Police  (956),  the  committee  asked  for  more  extensive 
changes.  Committee  members  noted  that  because  critical  incident  workshops 
were  conducted  only  in  CONUS  locations,  a  few  requirements  of  the  Military 
Police  job  were  missing.  Incumbents  in  this  MOS  serving  in  OCONUS  loca¬ 
tions  are  required  to  provided  combat  and  combat  support  functions.  Thus, 
four  performance  dimensions  describing  these  requirements  were  added  to  the 
Military  Police  MOS-specific  rating  scales:  (1)  "Navigation"  (Dimension 
H);  (2)  "Avoiding  enemy  detection"  (Dimension  I);  (3)  "Use  of  weapons  and 
other  equipment"  (Dimension  J);  and  (4)  "Courage  and  proficiency  in  battle" 
(Dimension  K).  Definitions  and  behavioral  anchors  for  these  scales  had 
been  developed  for  the  Infantryman  (IIB)  performance  dimensions  rating 
scales.  Proponent  committee  members  reviewed  these  definitions  and  anchors 
and  authorized  including  the  same  information  in  the  Military  Police  per¬ 
formance  rating  scales. 


’roiect-Wide  Review  Committee 


Following  the  Batch  B  field  test  sessions.  Project  A  staff  members  reviewed 
the  final  set  of  rating  scales.  This  group,  the  Criterion  Measurement  Task 
Force,  was  composed  of  project  personnel  responsible  for  developing  task- 
oriented  and  behavior-oriented  criterion  measures.  Further,  most  members 
had  participated  in  administering  criterion  measures  during  the  Batch  A  and 
Batch  B  field  tests. 

Task  Force  participants  reported  that  some  of  the  rating  scales,  the  be- 
haviorally  anchored  scales  in  particular,  required  considerable  reading 
time.  Consequently,  they  believed  that  many  raters  were  not  reading  the 
scales  thoroughly  before  making  their  ratings.  This  group  recommended  that 
we  pare  down  the  length  of  the  behavioral  anchors  to  help  ensure  that  all 
raters  would  review  the  anchors  thoroughly  before  using  them  to  evaluate 
incumbents. 

Therefore,  PDRI  staff  responsible  for  developing  the  nine  MOS-specific 
ratings  scales  modified  the  performance  dimension  definitions  and  scale 
anchors.  Their  goal  was  to  retain  the  specific  job  requirements  and  depic¬ 
tion  of  ineffective,  adequate,  or  effective  performance  in  each  anchor 
while  eliminating  unnecessary  information  or  lengthy  descriptions.  Figure 
5  contains  an  example  of  the  anchors  for  one  performance  dimension  included 
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in  the  Military  Police  (95B)  rating  scales  as  they  appeared  for  the  Batch  B 
administration  and  as  they  appear  for  the  Concurrent  Validity  study. 

The  rating  scales  to  be  administered  in  the  Concurrent  Validity  study  have 
been  included  in  Section  6  of  the  nine  MOS  appendices  to  this  report. 


Concurrent  Validity  Study  Plans 


Administration 

Throughout  field  test  data  collection  efforts,  PDRI  staff  members  con¬ 
ducting  rating  sessions  identified  problems  with  particular  rating  in¬ 
struments  and  ways  to  improve  the  rating  sessions.  This  information  was 
summarized  in  memos  to  the  various  task  leaders. 

In  sum,  rating  session  administrators  reported  few  or  no  problems  with  the 
MOS-specific  rating  scales.  The  only  complaint  with  these  particular 
scales  was  that  they  did  not  offer  a  "Cannot  Rate"  option  for  raters  who 
feel  unable  to  evaluate  an  incumbent  on  a  particular  performance  dimension. 
We  decided  that  for  the  Concurrent  Validity  study,  we  would  not  include  a 
"Cannot  Rate"  option.  Instead,  rating  session  administrators  would  be 
instructed  to  encourage  raters  to  evaluate  ratees  on  ALL  performance  dimen¬ 
sions.  Raters  who  simply  could  not  evaluate  a  ratee  on  a  particular  dimen¬ 
sion  would  be  asked  to  leave  that  scale  blank.  (For  a  complete  description 
of  guidelines  provided  to  rating  session  administrators  for  the  Concurrent 
Validity  study,  see  Pulakos  &  Borman,  1986.) 

Data  Analysis 

Data  analyses  for  Batch  A  and  Batch  B  field  test  data  have  been  described 
in  Chapter  2  of  this  report.  Briefly,  this  process  entailed  computing 
adjusted  rating  scores  for  raters  using  information  from  supervisors  and 
peers  combined;  following  the  adjustment  procedures,  we  analyzed  supervisor 
and  peer  rating  data  separately. 

Data  collected  in  the  Concurrent  Validity  study  with  a  larger  sample  size 
for  each  MOS  will  permit  additional  analyses  that  were  not  performed  on  the 
field  test  data.  These  include  the  following: 

•  Compare  adjusted  scores  with  unadjusted  scores  to  determine 
whether  one  procedure  is  better  than  the  other  in  terms  of  re¬ 
liability,  halo,  and  rating  score  distributions. 

•  Factor  analyze  intercorrelations  computed  between  performance 
dimension  ratings  provided  by  supervisors.  Compare  the  resulting 
factors  with  factors  obtained  from  the  peer  rating  data. 

•  Determine  whether  or  how  to  best  combine  the  information  supplied 
by  supervisors  and  peers. 
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•  Examine  correlations  between  ratings  obtained  on  MOS-specific 
rating  scales  and  criterion  data  obtained  on  other  measures 
(e.g.,  hands-on  tests,  job  knowledge  tests,  training  knowledge 
tests).  This  information  would  provide  a  clearer  understanding 
of  the  job  performance  components  that  we  are  capturing  in  the 
MOS-specific  BARS.  Further,  these  data  would  be  useful  in  de¬ 
veloping  criterion  composite  measures. 


Summary 

In  this  chapter,  we  described  the  information  used  to  modify  the  MOS- 
specific  behaviorally  anchored  rating  scales  developed  for  nine  MOS,  prior 
to  their  use  in  the  Concurrent  Validity  study.  Briefly,  we  relied  on 
information  obtained  from  field  test  administrations,  recommendations  pro¬ 
vided  by  subject  matter  experts,  and  suggestions  offered  by  project  staff. 

In  general,  very  few  content  changes  were  made  on  the  rating  scales,  with 
the  exception  of  additional  scales  developed  for  Military  Police  (95B)  to 
reflect  overseas  requirements.  Across  all  MOS-specific  rating  scales, 
however,  we  pruned  the  behavioral  anchors  to  reduce  reading  requirements 
while  maintaining  the  flavor  and  standards  depicted  in  each  anchor. 
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